Spectral Clustering Based on Local PCA 
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We propose a spectral clustering method based on local principal components analysis (PCA). After 
performing local PCA in selected neighborhoods, the algorithm builds a nearest neighbor graph 
weighted according to a discrepancy between the principal subspaces in the neighborhoods, and 
then applies spectral clustering. As opposed to standard spectral methods based solely on pairwise 
distances between points, our algorithm is able to resolve intersections. We establish theoretical 
guarantees for simpler variants within a prototypical mathematical framework for multi-manifold 
clustering, and evaluate our algorithm on various simulated data sets. 
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1 Introduction 

The task of multi-manifold clustering, where the data are assumed to be located near surfaces 
embedded in Euclidean space, is relevant in a variety of applications. In cosmology, it arises as 
the extraction of galaxy clusters in the form of filaments (curves) and walls (surfaces) (Martinez 
and Saar, 2002; Valdarnini, 2001); in motion segmentation, moving objects tracked along different 
views form affine or algebraic surfaces (Chen et al., 2009; Fu et al., 2005; Ma et al., 2008; Vidal and 
Ma, 2006) ; this is also true in face recognition, in the context of images of faces in fixed pose under 
varying illumination conditions (Basri and Jacobs, 2003; Epstein et al., 1995; Ho et al., 2003). 

We consider a stylized setting where the underlying surfaces are nonparametric in nature, with 
a particular emphasis on situations where the surfaces intersect. Specifically, we assume the sur- 
faces are smooth, for otherwise the notion of continuation is potentially ill-posed. For example, 
without smoothness assumptions, an L-shaped cluster is indistinguishable from the union of two 
line-segments meeting at right angle. 

Spectral methods (Luxburg, 2007) are particularly suited for nonparametric settings, where the 
underlying clusters are usually far from convex, making standard methods like K-means irrelevant. 
However, a drawback of standard spectral approaches such as the well-known variation of Ng, 
Jordan, and Weiss (2002) is their inability to separate intersecting clusters. Indeed, consider the 
simplest situation where two straight clusters intersect at right angle, pictured in Figure 1 below. 
The algorithm of Ng et al. (2002) is based on pairwise affinities that are decreasing in the distances 
between data points, making it insensitive to smoothness and, therefore, intersections. And indeed, 
this algorithm typically fails to separate intersecting clusters, even in the easiest setting of Figure 1. 

As argued in (Agarwal et al., 2006, 2005; Shashua et al., 2006), a multiway affinity is needed to 
capture complex structure in data (here, smoothness) beyond proximity attributes. For example, 
Chen and Lerman (2009b) use a flatness affinity in the context of hybrid linear modeling, where 
the surfaces are assumed to be affine subspaces, and subsequently extended to algebraic surfaces 
via the 'kernel trick' (Chen, Atev, and Lerman, 2009). Moving beyond parametric models, Arias- 
Castro, Chen, and Lerman (2011) consider a localized measure of flatness; see also Elhamifar and 
Vidal (2011). Continuing this line of work, we suggest a spectral clustering method based on the 
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Figure 1: Two rectangular clusters intersecting at right angle. Left: the original data. Center: a 
typical output of the standard spectral clustering method of Ng et al. (2002), which is generally 
unable to resolve intersections. Right: our method. 



estimation of the local linear structure (tangent bundle) via local principal component analysis 
(PCA). 

The idea of using local PCA combined with spectral clustering has precedents in the literature. 
In particular, our method is inspired by the work of Goldberg, Zhu, Singh, Xu, and Nowak (2009), 
where the authors develop a spectral clustering method within a semi-supervised learning frame- 
work. Local PCA is also used in the multiscale, spectral-flavored algorithm of Kushnir, Galun, and 
Brandt (2006). This approach is in the Zeitgeist. While writing this paper, we became aware of two 
very recent publications, by Wang, Jiang, Wu, and Zhou (2011) and by Gong, Zhao, and Medioni 
(2012), both proposing approaches very similar to ours. We comment on these spectral methods 
in more detail later on. 

The basic proposition of local PCA combined with spectral clustering has two main stages. 
The first one forms an affinity between a pair of data points that takes into account both their 
Euclidean distance and a measure of discrepancy between their tangent spaces. Each tangent space 
is estimated by PCA in a local neighborhood around each point. The second stage applies standard 
spectral clustering with this affinity. As a reality check, this relatively simple algorithm succeeds 
at separating the straight clusters in Figure 1. We tested our algorithm in more elaborate settings, 
some of them described in Section 4. 

Besides spectral-type approaches to multi-manifold clustering, other methods appear in the 
literature. The methods we know of either assume that the different surfaces do not intersect 
(Polito and Perona, 2001), or that the intersecting surfaces have different intrinsic dimension or 
density (Gionis et al., 2005; Haro et al., 2007). The few exceptions tend to propose very complex 
methods that promise to be challenging to analyze (Guo et al., 2007; Souvenir and Pless, 2005). 

Our contribution is the design and detailed study of a prototypical spectral clustering algorithm 
based on local PCA, tailored to settings where the underlying clusters come from sampling in the 
vicinity of smooth surfaces that may intersect. We endeavored to simplify the algorithm as much as 
possible without sacrificing performance. We provide theoretical results for simpler variants within 
a standard mathematical framework for multi-manifold clustering. To our knowledge, these are 
the first mathematically backed successes at the task of resolving intersections in the context of 
multi-manifold clustering, with the exception of (Arias-Castro et al., 2011), where the correspond- 
ing algorithm is shown to succeed at identifying intersecting curves. The salient features of that 
algorithm are illustrated via numerical experiments. 

The rest of the paper is organized as follows. In Section 2, we introduce our methods. In Sec- 
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tion 3, we analyze our methods in a standard mathematical framework for multi-manifold learning. 
In Section 4, we perform some numerical experiments illustrating several features of our algorithm. 
In Section 5, we discuss possible extensions. 

2 The methodology 

We introduce our algorithm and simpler variants that are later analyzed in a mathematical frame- 
work. We start with some review of the literature, zooming in on the most closely related publica- 
tions. 

2.1 Some precedents 

Using local PCA within a spectral clustering algorithm was implemented in four other publications 
we know of (Goldberg et al., 2009; Gong et al., 2012; Kushnir et al., 2006; Wang et al., 2011). 
As a first stage in their semi-supervised learning method, Goldberg, Zhu, Singh, Xu, and Nowak 
(2009) design a spectral clustering algorithm. The method starts by subsampling the data points, 
obtaining 'centers' in the following way. Draw y l at random from the data and remove its ^-nearest 
neighbors from the data. Then repeat with the remaining data, obtaining centers yi,y 2 , ■ ■ ■ ■ Let 
Cj denote the sample covariance in the neighborhood of y i made of its ^-nearest neighbors. An 
m-nearest-neighbor graph is then defined on the centers in terms of the Mahalanobis distances. 
Explicitly, the centers y i and yj are connected in the graph if yj is among the m nearest neighbors 
of y i in Mahalanobis distance 

\\c; 1/2 ( yi - yj )\\, (i) 

or vice- versa. The parameters I and m are both chosen of order logn. An existing edge between 
j/j and yj is then weighted by exp(— iT^/77 2 ), where Hij denotes the Hellinger distance between 
the probability distributions JV(0, Cj) and Af(0, Cj). The spectral graph partitioning algorithm of 
Ng, Jordan, and Weiss (2002) — detailed in Algorithm 1 — is then applied to the resulting affinity 
matrix, with some form of constrained K-means. We note that Goldberg et al. (2009) evaluate their 
method in the context of semi-supervised learning where the clustering routine is only required to 
return subclusters of actual clusters. In particular, the data points other than the centers are 
discarded. Note also that their evaluation is empirical. 

Algorithm 1 Spectral Graph Partitioning (Ng, Jordan, and Weiss, 2002) 
Input: 

Affinity matrix W = (Wij), size of the partition K 
Steps: 

1: Compute Z = (Zij) according to Zij = Wij/y/DiDj, with Di = Y^j=i Wij- 

2: Extract the top K eigenvectors of Z. 

3: Renormalize each row of the resulting n x K matrix. 

4: Apply i^-means to the row vectors. 



The algorithm proposed by Kushnir, Galun, and Brandt (2006) is multiscale and works by 
coarsening the neighborhood graph and computing sampling density and geometric information 
inferred along the way such as obtained via PCA in local neighborhoods. This bottom-up flow 
is then followed by a top-down pass, and the two are iterated a few times. The algorithm is too 
complex to be described in detail here, and probably too complex to be analyzed mathematically. 
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The clustering methods of Goldberg et al. (2009) and ours can be seen as simpler variants that 
only go bottom up and coarsen the graph only once. 

In the last stages of writing this paper, we learned of the works of Wang, Jiang, Wu, and 
Zhou (2011) and Gong, Zhao, and Medioni (2012), who propose algorithms very similar to our 
Algorithm 3 detailed below. Note that these publications do not provide any theoretical guarantees 
for their methods, which is one of our main contributions here. 

2.2 Our algorithms 

We now describe our method and propose several variants. Our setting is standard: we observe 
data points x\, . . . ,x n £ ~K D that we assume were sampled in the vicinity of K smooth surfaces 
embedded in K . . The setting is formalized later in Section 3.1. 

2.2.1 Connected component extraction: comparing local covariances 

We start with our simplest variant, which is also the most natural. The method depends on a 
neighborhood radius r > 0, a spatial scale parameter e > and a covariance (relative) scale r\ > 0. 
For a vector x, \\x\\ denotes its Euclidean norm, and for a (square) matrix A, \\A\\ denotes its 
spectral norm. For n £ N, we denote by [n] the set {1, . . . , n}. Given a data set x\, . . . , x n , for any 
point x E ~R D and r > 0, define the neighborhood 

N r (x) = {xj : \\x — Xj\\ < r}. (2) 



Algorithm 2 Connected Component Extraction: Comparing Covariances 
Input: 

Data points x%, . . . , x n ; neighborhood radius r > 0; spatial scale e > 0, covariance scale rj > 0. 
Steps: 

1: For each i £ [n], compute the sample covariance matrix C% of N r (xi). 
2: Compute the following affinities between data points: 

W ij = ^{Ijaji-asjll^e} " ^-{WCi-C^Kr,^}- (3) 

3: Remove Xi when there is Xj such that \\xj — Xi\\ < r and \\Cj — Ci\\ > rjr 2 . 
4: Extract the connected components of the resulting graph. 

5: Points removed in Step 3 are grouped with the closest point that survived Step 3. 



In summary, the algorithm first creates an unweighted graph: the nodes of this graph are the 
data points and edges are formed between two nodes if both the distance between these nodes and 
the distance between the local covariance structures at these nodes are sufficiently small. After 
removing the points near the intersection at Step 3, the algorithm then extracts the connected 
components of the graph. 

In principle, the neighborhood size r is chosen just large enough that performing PCA in each 
neighborhood yields a reliable estimate of the local covariance structure. For this, the number of 
points inside the neighborhood needs to be large enough, which depends on the sample size n, the 
sampling density, intrinsic dimension of the surfaces and their surface area (Hausdorff measure), 
how far the points are from the surfaces (i.e., noise level), and the regularity of the surfaces. The 
spatial scale parameter e depends on the sampling density and r. It needs to be large enough that a 
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point has plenty of points within distance e, including some across an intersection, so each cluster is 
strongly connected. At the same time, e needs to be small enough that a local linear approximation 
to the surfaces is a relevant feature of proximity. Its choice is rather similar to the choice of the 
scale parameter in standard spectral clustering (Ng et al., 2002; Zelnik-Manor and Perona, 2004). 
The orientation scale r\ needs to be large enough that centers from the same cluster and within 
distance e of each other have local covariance matrices within distance r\r 2 , but small enough that 
points from different clusters near their intersection have local covariance matrices separated by 
a distance substantially larger than ijr 2 . This depends on the curvature of the surfaces and the 
incidence angle at the intersection of two (or more) surfaces. Note that a typical covariance matrix 
over a ball of radius r has norm of order r 2 , which justifies using our choice of parametrization. In 
the mathematical framework we introduce later on, these parameters can be chosen automatically 
as done in (Arias-Castro et al., 2011), at least when the points are sampled exactly on the surfaces. 
We will not elaborate on that since in practice this does not inform our choice of parameters. 

The rationale behind Step 3 is as follows. As we just discussed, the parameters need to be 
tuned so that points from the same cluster and within distance e have local covariance matrices 
within distance ryr 2 . Hence, Xi and Xj in Step 3 are necessarily from different clusters. Since they 
are near each other, in our model this will imply that they are close to an intersection. Therefore, 
roughly speaking, Step 3 removes points near an intersection. 

Although this method works in simple situations like that of two intersecting segments (Fig- 
ure 1), it is not meant to be practical. Indeed, extracting connected components is known to be 
sensitive to spurious points and therefore unstable. Furthermore, we found that comparing local 
covariance matrices as in affinity (3) tends to be less stable than comparing local projections as in 
affinity (4), which brings us to our next variant. 

2.2.2 Connected component extraction: comparing local projections 

We present another variant also based on extracting the connected components of a neighborhood 
graph that compares orthogonal projections onto the largest principal directions. 



Algorithm 3 Connected Component Extraction: Comparing Projections 
Input: 

Data points x\, . . . , x n ; neighborhood radius r > 0, spatial scale e > 0, projection scale 77 > 0. 
Steps: 

1: For each i G [n], compute the sample covariance matrix Cj of N r (xi). 

2: Compute the projection Q i onto the eigenvectors of Cj with eigenvalue exceeding ||Cj||. 
3: Compute the following affinities between data points: 

W H = • ^WQi-Q^n}- ( 4 ) 

4: Extract the connected components of the resulting graph. 



We note that the local intrinsic dimension is determined by thresholding the eigenvalues of the 
local covariance matrix, keeping the directions with eigenvalues within some range of the largest 
eigenvalue. The same strategy is used by Kushnir et al. (2006), but with a different threshold. The 
method is a hard version of what we implemented, which we describe next. 
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2.2.3 Covariances or projections? 



In our numerical experiments, we tried working both directly with covariance matrices as in (3) and 
with projections as in (4). Note that in our experiments we used spectral graph partitioning with 
soft versions of these affinities, as described in Section 2.2.4. We found working with projections to 
be more reliable. The problem comes, in part, from boundaries. When a surface has a boundary, 
local covariances over neighborhoods that overlap with the boundary are quite different from local 
covariances over nearby neighborhoods that do not touch the boundary. Consider the example of 
two segments, S\ and 52, intersecting at an angle of 9 G (0, 7r/2) at their middle point, specifically 

S 1 = [-1, 1] x {0}, 5*2 = {(x,xtan<9) : x G [- cos 9, cos 9]}. 

Assume there is no noise and that the sampling is uniform. Assume r G (0, \ sin#) so that the disc 
centered at x\ := (1/2,0) does not intersect S2, and the disc centered at X2 := cos#, \ tan6>) 
does not intersect Si. Let xq = (1,0). For x G 5i U 52, let C x denote the local covariance at x 
over a ball of radius r. Simple calculations yield: 

_ r 2 (I 0\ _r 2 /l 0\ _ r 2 ( cos 2 9 sin(0) cos(0) 



~ 12 \0 Oj ' 331 ~ 3 V° 0/ ' 2 ~ 3 \ S m(9)co S (9) sin 2 
and therefore 

r 2 \f 7 lr 2 

||Ca;o — Ctci || = || Ctci — C X2 1| = - sin0. 

When sm.9 < (roughly, 9 < 32°), the difference in Frobenius norm between the local covariances 
at xq, x\ G Si is larger than that at x\ G S\ and X2 G 52- As for projections, however, 

(1(1 ( l °\ (1 ( cos2 ° sin(6») cos((9) 
Q xo ~ Q X1 ~ qJ , Q X2 ~ y s[n ^ cos (^) sin 2 Q 

so that 

UQa* - Q X1 II = 0, \\Q X1 - Q X2 \\ = V2sin9. 

While in theory the boundary points account for a small portion of the sample, in practice this 
is not the case and we find that spectral graph partitioning is challenged by having points near the 
boundary that are far (in affinity) from nearby points from the same cluster. This may explain 
why the (soft version of) affinity (4) yields better results than the (soft version of) affinity (3) in 
our experiments. 



2.2.4 Spectral Clustering Based on Local PCA 

The following variant is more robust in practice and is the algorithm we actually implemented. 
The method assumes that the surfaces are of same dimension d and that they are K surfaces, with 
both parameters K and d known. 

We note that y 1 ,... ,y no forms an r-packing of the data. The underlying rationale for this 
coarsening is justified in (Goldberg et al., 2009) by the fact that the covariance matrices, and also 
the top principal directions, change smoothly with the location of the neighborhood, so that without 
subsampling these characteristics would not help detect the abrupt event of an intersection. The 
affinity (5) is of course a soft version of (4). 



6 



Algorithm 4 Spectral Clustering Based on Local PCA 



Input: 

Data points x%, . . . , x n ; neighborhood radius r > 0; spatial scale e > 0, projection scale r\ > 0; 
intrinsic dimension <i; number of clusters K. 

Steps: 

0: Pick one point y± at random from the data. Pick another point y 2 among the data points 
not included in A r r (y 1 ), and repeat the process, selecting centers y 1 , . . . ,y no - 

1: For each i = 1, . . . , no, compute the sample covariance matrix C7, of N r (y i ). Let Q i denote 
the orthogonal projection onto the space spanned by the top d eigenvectors of Cj. 

2: Compute the following affinities between center pairs: 

^e*p(-M^. e ^_WiZ^f). (5) 
3: Apply spectral graph partitioning (Algorithm 1) to W. 

4: The data points are clustered according to the closest center in Euclidean distance. 



2.2.5 Comparison with closely related methods 

We highlight some differences with the other proposals in the literature. We first compare our 
approach to that of Goldberg et al. (2009), which was our main inspiration. 

• Neighborhoods. Comparing with Goldberg et al. (2009), we define neighborhoods over r- 
balls instead of ^-nearest neighbors, and connect points over e-balls instead of m-nearest 
neighbors. This choice is for convenience, as these ways are in fact essentially equivalent 
when the sampling density is fairly uniform. This is elaborated at length in (Arias-Castro, 
2011; Brito et al, 1997; Maier et al., 2009). 

• Mahalanobis distances. Goldberg et al. (2009) use Mahalanobis distances (1) between centers. 
In our version, we could for example replace the Euclidean distance \\xi — Xj\\ in the affinity 
(3) with the average Mahalanobis distance 

Iicr^^-^II + IICT 1 / 2 ^.-^)!!. ( 6 ) 

We actually tried this and found that the algorithm was less stable, particularly under low 
noise. Introducing a regularization in this distance — which requires the introduction of 
another parameter — solves this problem partially. 

That said, using Mahalanobis distances makes the procedure less sensitive to the choice of e, 
in that neighborhoods may include points from different clusters. Think of two parallel line 
segments separated by a distance of 5, and assume there is no noise, so the points are sampled 
exactly from these segments. Assuming an infinite sample size, the local covariance is the 
same everywhere so that points within distance e are connected by the affinity (3). Hence, 
Algorithm 2 requires that e < 5. In terms of Mahalanobis distances, points on different 
segments are infinitely separated, so a version based on these distances would work with any 
e > 0. In the case of curved surfaces and/or noise, the situation is similar, though not as 
evident. Even then, the gain in performance guarantees is not obvious, since we only require 
that e be slightly larger in order of magnitude that r. 
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• Hettinger distances. As we mentioned earlier, Goldberg et al. (2009) use Hellinger distances of 
the probability distributions JV(0, Cj) and A/"(0, Cj) to compare covariance matrices, specif- 
ically 

/ oD/2 detjCiCj)^ \ 1/2 

y 1 det(C, + C,)V2j ' ^ 

if Cj and Cj are full-rank. While using these distances or the Frobenius distances makes 
little difference in practice, we find it easier to work with the latter when it comes to prov- 
ing theoretical guarantees. Moreover, it seems more natural to assume a uniform sampling 
distribution in each neighborhood rather than a normal distribution, so that using the more 
sophisticated similarity (7) does not seem justified. 

• K-means. We use K-means-l— |- for a good initialization. However, we found that the more 
sophisticated size-constrained K-means (Bradley et al., 2000) used in (Goldberg et al., 2009) 
did not improve the clustering results. 

As we mentioned above, our work was developed in parallel to that of Wang et al. (2011) and 
Gong et al. (2012). We highlight some differences. They do not subsample, but estimate the local 
tangent space at each data point £Cj. Wang et al. (2011) fit a mixture of d-dimensional affine 
subspaces to the data using MPPCA (Tipping and Bishop, 1999), which is then used to estimate 
the tangent subspaces at each data point. Gong et al. (2012) develop some sort of robust local 
PCA. While Wang et al. (2011) assume all surfaces are of same dimension known to the user, Gong 
et al. (2012) estimate that locally by looking at the largest gap in the spectrum of estimated local 
covariance matrix. This is similar in spirit to what is done in Step 2 of Algorithm 3, but we did 
not include this step in Algorithm 4 because we did not find it reliable in practice. We also tried 
estimating the local dimensionality using the method of Little et al. (2009), but this failed in the 
most complex cases. 

Wang et al. (2011) use a nearest-neighbor graph and their affinity is defined as 




W ij = A ir y_|coB0.(i,j) I , (8) 

where A^ = 1 if X{ is among the I- nearest neighbors of Xj, or vice versa, while Ajj = otherwise; 
&i(i,j) > ••• > Gd(i,j) are the principal (aka, canonical) angles (Stewart and Sun, 1990) between 
the estimated tangent subspaces at Xi and Xj. I and a are parameters of the method. Gong et al. 
(2012) define an affinity that incorporates the self-tuning method of Zelnik-Manor and Perona 
(2004); in our notation, their affinity is 

/ \\x i -x j \\ 2 \ ( asin 2 ||Qj - Q „ , 
exp J — • exp no,/ \ • (9) 




I i 211 112//' 



where £j is the distance from Xi to its ^-nearest neighbor. I is a parameter. 

Although we do not analyze their respective ways of estimating the tangent subspaces, our 
analysis provides essential insights into their methods, and for that matter, any other method built 
on spectral clustering based on tangent subspace comparisons. 



3 Mathematical Analysis 

While the analysis of Algorithm 4 seems within reach, there are some complications due to the fact 
that points near the intersection may form a cluster of their own — we were not able to discard this 
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possibility. Instead, we study the simpler variants described in Algorithm 2 and Algorithm 3. Even 
then, the arguments are rather complex and interestingly involved. The theoretical guarantees that 
we thus obtain for these variants are stated in Theorem 1 and proved in Section 6. We comment 
on the analysis of Algorithm 4 right after that. We note that there are very few theoretical results 
on resolving intersecting clusters. In fact, we are only aware of (Chen and Lerman, 2009a) in 
the context of affine surfaces, (Soltanolkotabi and Candes, 2011) in the context of affine surfaces 
without noise and (Arias-Castro et al., 2011) in the context of curves. 

The generative model we assume is a natural mathematical framework for multi-manifold learn- 
ing where points are sampled in the vicinity of smooth surfaces embedded in Euclidean space. For 
concreteness and ease of exposition, we focus on the situation where two surfaces (i.e., K = 2) 
of same dimension 1 < d < D intersect. This special situation already contains all the geometric 
intricacies of separating intersecting clusters. On the one hand, clusters of different intrinsic dimen- 
sion may be separated with an accurate estimation of the local intrinsic dimension without further 
geometry involved (Haro et al., 2007). On the other hand, more complex intersections (3- way and 
higher) complicate the situation without offering truly new challenges. For simplicity of exposition, 
we assume that the surfaces are submanifolds without boundary, though it will be clear from the 
analysis (and the experiments) that the method can handle surfaces with (smooth) boundaries that 
may self- intersect. We discuss other possible extensions in Section 5. 

Within that framework, we show that Algorithm 2 and Algorithm 3 are able to identify the 
clusters accurately except for points near the intersection. Specifically, with high probability with 
respect to the sampling distribution, Algorithm 2 divides the data points into two groups such 
that, except for points within distance Ce of the intersection, all points from the first cluster are in 
one group and all points from the second cluster are in the other group. The constant C depends 
on the surfaces, including their curvatures, separation between them and intersection angle. The 
situation for Algorithm 3 is more complex, as it may return more than two clusters, but the main 
feature is that most of two clusters (again, away from the intersection) are in separate connected 
components. 

3.1 Generative model 

Each surface we consider is a connected, C 2 and compact submanifold without boundary and 
of dimension d embedded in ~S, D '. Any such surface has a positive reach, which is what we use 
to quantify smoothness. The notion of reach was introduced by Federer (1959). Intuitively, a 
surface has reach exceeding r if, and only if, one can roll a ball of radius r on the surface without 
obstruction (Walther, 1997). Formally, for x G M D and Scl°, let 

dist(a:, S) = inf \\x — 

and 

B(S, r) = {x : dist{x,S) < r}, 

which is often called the r-tubular neighborhood (or r-neighborhood) of S. The reach of S is the 
supremum over r > such that, for each x 6 B(S,r), there is a unique point in S nearest x. It is 
well-known that, for C 2 submanifolds, the reach bounds the radius of curvature from below (Federer, 
1959, Lem. 4.17). For submanifolds without boundaries, the reach coincides with the condition 
number introduced in (Niyogi et al., 2008). 

When two surfaces Si and S 2 intersect, meaning Si n S 2 7^ 0, we define their incidence angle as 

9(Si,S 2 ) := m{(e min (T Sl (s),T S2 (s)) :aeS 1 nS 2 ), (10) 
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where T$(s) denote the tangent subspace of submanifold S at point s £ S, and 6 m m(Ti,T2) is the 
smallest nonzero principal (aka, canonical) angle between subspaces T\ and T2 (Stewart and Sun, 
1990). 

The clusters are generated as follows. Each data point Xi is drawn according to 

Xi = Si + Zi, (11) 

where Sj is drawn from the uniform distribution over S\ U S2 and Z\ is an additive noise term 
satisfying \z{\ < r — thus r represents the noise or jitter level, and r = means that the points 
are sampled on the surfaces. We assume the points are sampled independently of each other. We 
let 

I k = {% : Si € S k }, (12) 
and the goal is to recover the groups I\ and I2, up to some errors. 

3.2 Performance guarantees 

We state some performance guarantees for Algorithm 2 and Algorithm 3. 

Theorem 1. Consider two connected, compact, twice continuously differentiable submanifolds with- 
out boundary, of same dimension, intersecting at a strictly positive angle, with the intersection set 
having strictly positive reach. Assume the parameters are set so that 

t < rrj/C, r < e/C, e < rj/C, 77 < 1/C, (13) 

and C > is large enough. Then with probability at least 1 — Cnexp [ — nr d r/ 2 jC\ : 

• Algorithm 2 returns exactly two groups such that two points from different clusters are not 
grouped together unless one of them is within distance Cr from the intersection. 

• Algorithm 3 returns at least two groups, and such that two points from different clusters are 
not grouped together unless one of them is within distance Cr from the intersection. 

We note that the constant C > depends on what configuration the surfaces are in, in particular 
their reach and intersection angle, but also aspects that are harder to quantify, like their separation 
away from their intersection. 

We now comment on the challenge of proving a similar result for Algorithm 4. This algorithm 
relies on knowledge of the intrinsic dimension of the surfaces d and the number of clusters (here 
K = 2), but these may be estimated as in (Arias-Castro et al., 2011), at least in theory, so we 
assume these parameters are known. The subsampling done in Step does not pose any problem 
whatsoever, since the centers are well-spread when the points themselves are. The difficulty resides 
in the application of the spectral graph partitioning, Algorithm 1. If we were to include the 
intersection-removal step (Step 3 of Algorithm 2) before applying spectral graph partitioning, then 
a simple adaptation of arguments in (Arias-Castro, 2011) would suffice. The real difficulty, and 
potential pitfall of the method in this framework (without the intersection- removal step), is that 
the points near the intersection may form their own cluster. For example, in the simplest case of 
two affine surfaces intersecting at a positive angle and no sampling noise, the projection matrix 
at a point near the intersection — meaning a point whose r-ball contains a substantial piece of 
both surfaces — would be the projection matrix onto S\ + S2 seen as a linear subspace. We were 
not able to discard this possibility, although we do not observe this happening in practice. A 
possible remedy is to constrain the K-means part to only return large-enough clusters. However, 
a proper analysis of this would require a substantial amount of additional work and we did not 
engage seriously in this pursuit. 
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4 Numerical Experiments 



We tried our code§ on a few artificial examples. Very few algorithms were designed to work in 
the general situation we consider here and we did not compare our method with any other. As 
we argued earlier, the methods of Wang et al. (2011) and Gong et al. (2012) are quite similar to 
ours, and we encourage the reader to also look at the numerical experiments they performed. Our 
numerical experiments should be regarded as a proof of concept, only here to show that our method 
can be implemented and works on some toy examples. 

In all experiments, the number of clusters K and the dimension of the manifolds d are assumed 
known. We choose spatial scale e and the projection scale r/ automatically as follows: we let 

e= max min \\y i — yA\, (14) 

l<i<no jy^i 

and 

r\ = median \\Pi — Pj\\. (15) 

(»>i):[|yj-l/ill<e 

Here, we implicitly assume that the union of all the underlying surfaces forms a connected set. In 
that case, the idea behind choosing e as in (14) is that we want the e-graph on the centers . . . , y n 
to be connected. Then r\ is chosen so that a center y i remains connected in the (e, r/)-graph to most 
of its neighbors in the e-graph. 

The neighborhood radius r is chosen by hand for each situation. Although we do not know 
how to choose r automatically, there are some general ad hoc guidelines. When r is too large, the 
local linear approximation to the underlying surfaces may not hold in neighborhoods of radius r, 
resulting in local PCA becoming inappropriate. When r is too small, there might not be enough 
points in a neighborhood of radius r to accurately estimate the local tangent subspace to a given 
surface at that location, resulting in local PCA becoming inaccurate. From a computational point 
of view, the smaller r, the larger the number of neighborhoods and the heavier the computations, 
particularly at the level of spectral graph partitioning. In our numerical experiments, we find that 
our algorithm is more sensitive to the choice of r when the clustering problem is more difficult. We 
note that automatic choice of tuning parameters remains a challenge in clustering, and machine 
learning at large, especially when no labels are available whatsoever. See (Kaslovsky and Meyer, 
2011; Little et al., 2009; Zelnik-Manor and Perona, 2004; Zhang et al, 2012). 

Since the algorithm is randomized (see Step in Algorithm 4) we repeat each simulation 100 
times and report the median misclustering rate and number of times where the misclustering rate 
is smaller than 5%, 10%, and 15%. 

We first run Algorithm 4 on several artificial data sets, which are demonstrated in the LHS 
of Figures 2 and 3. Table 1 reports the local radius r used for each data set (R is the global 
radius of each data set), and the statistics for misclustering rates. Typical clustering results are 
demonstrated in the RHS of Figures 2 and 3. It is evident that Algorithm 4 performs well in these 
simulations. 

In another simulation, we show the dependence of the success of our algorithm on the intersecting 
angle between curves in Table 2 and Figure 4. Here, we fix two curves intersecting at a point, and 
gradually decrease the intersection angle by rotating one of them while holding the other one fixed. 
The angles are vr/2, tt/4, tt/6 and n/8. From the table we can see that our algorithm performs 
well when the angle is vr/4, but the performance deteriorates as the angle becomes smaller, and the 
algorithm almost always fails when the angle is ir/8. 

§The code is available online at http://www.ima.umn.edu/~zhang620/. 
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Figure 2: Performance of Algorithm 4 on data sets "Three curves" and "Self-intersecting curves". 
Left column is the input data sets, and right column demonstrates the typical clustering. 
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Figure 3: Performance of Algorithm 4 on data sets "Two spheres", "Mobius strips", "Monkey 
saddle" and "Paraboloids". Left column is the input data sets, and right column demonstrates the 
typical clustering. 
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Figure 4: Performance of Algorithm 4 on two curves intersecting at various angles |, |, |, |-. 
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dataset 


r 


median misclustering rate 


5% 


10% 


15% 


Three curves 


0.02 (0.034R) 


4.16% 


76 


89 


89 


Self-intersecting curves 


0.1 (0.017J?) 


1.16% 


85 


85 


86 


Two spheres 


0.2 (0.059.R) 


3.98% 


100 


100 


100 


Mobius strips 


0.1 (0.02&R) 


2.22% 


85 


86 


88 


Monkey saddle 


0.1 (0.069.R) 


9.73% 





67 


97 


Paraboloids 


0.07 (0.048i?) 


10.42% 





12 


91 



Table 1: Choices for r and misclustering statistics for the artificial data sets demonstrated in 
Figures 2 and 3. The statistics are based on 100 repeats and include the median misclustering rate 
and number of repeats where the misclustering rate is smaller than 5%, 10% and 15%. 



Intersecting angle 


r 


median misclustering rate 


5% 


10% 


15% 


vr/2 


0.02 (0.0347-?) 


2.08% 


98 


98 


98 


vr/4 


0.02 (0.034i?) 


3.33% 


92 


94 


94 


vr/6 


0.02 (0.034i?) 


5.53% 


32 


59 


59 


tt/8 


0.02 (0.033.R) 


27.87% 





2 


2 



Table 2: Choices for r and misclustering statistics for the instances of two intersecting curves 
demonstrated in Figure 4. The statistics are based on 100 repeats and include the median misclus- 
tering rate and number of repeats where the misclustering rate is smaller than 5%, 10% and 15%. 



5 Discussion 

We distilled the ideas of Goldberg et al. (2009) and of Kushnir et al. (2006) to cluster points 
sampled near smooth surfaces. The key ingredient is the use of local PCA to learn about the local 
spread and orientation of the data, so as to use that information in an affinity when building a 
neighborhood graph. 

In a typical stylized setting for multi-manifold clustering, we established performance bounds for 
the simple variants described in Algorithm 2 and Algorithm 3, which essentially consist of connect- 
ing points that are close in space and orientation, and then extracting the connected components 
of the resulting graph. Both are shown to resolve general intersections as long as the incidence 
angle is strictly positive and the parameters are carefully chosen. As is commonly the case in such 
analyzes, our setting can be generalized to other sampling schemes, to multiple intersections, to 
some features of the surfaces changing with the sample size, and so on, in the spirit of (Arias- 
Castro, 2011; Arias-Castro et al., 2011; Chen and Lerman, 2009a). We chose to simplify the setup 
as much as possible while retaining the essential features that makes resolving intersecting clusters 
challenging. The resulting arguments are nevertheless rich enough to satisfy the mathematically 
thirsty reader. 

We implemented a spectral version of Algorithm 3, described in Algorithm 4, that assumes 
the intrinsic dimensionality and the number of clusters are known. The resulting approach is very 
similar to what is offered by Wang et al. (2011) and Gong et al. (2012), although it was developed 
independently of these works. Algorithm 4 is shown to perform well in some simulated experiments, 
although it is somewhat sensitive to the choice of parameters. This is the case of all other methods 
for multi-manifold clustering we know of and choosing the parameters automatically remains an 
open challenge in the field. 
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6 Proofs 



We start with some additional notation. The ambient space is R unless noted otherwise. For a 
vector v £ M. D , \\v\\ denotes its Euclidean norm and for a real matrix M £ ]R- Dx - D , ||iW|| denotes 
the corresponding operator norm. For a point x £ IR 15 and r > 0, B(x,r) denotes the open ball of 
center x and radius r, i.e., B(x,r) = {y £ ]R-° : ||y — :c|| < r}. For a set S and a point x, define 
dist(cc, S) = inf{||a; — y\\ : y £ S}. For two points a, b in the same Euclidean space, b — a denotes 
the vector moving a to b. For a point a and a vector i? in the same Euclidean space, a + v denotes 
the translate of a by v. We identify an affine subspace T with its corresponding linear subspace, 
for example, when saying that a vector belongs to T. 

For two subspaces T and T', of possibly different dimensions, we denote by < $max 

(T, T") < 

7r/2 the largest and by 9 m i n (T,T') the smallest nonzero principal angle between T and T' (Stewart 
and Sun, 1990). When v is a vector and T is a subspace, Z(v,T) := 6 maiX (Rv,T) this is the usual 
definition of the angle between v and T. 

For a subset A C M 13 and positive integer d, vold(A) denotes the d-dimensional Hausdorff 
measure of A, and vo1(j4) is defined as voldi m ( J 4)( j 4)) where dim(j4) is the Hausdorff dimension of 
A. For a Borel set A, let Aa denote the uniform distribution on A. 

For a set S C M. D with reach at least 1/k, and x with dist(a;,S') < 1/k, let Ps(x) denote 
the metric projection of x onto S, that is, the point on S closest to x. Note that, if T is an 
affine subspace, then is the usual orthogonal projection onto T. Let denote the class of 

connected, C 2 and compact d-dimensional submanifolds without boundary embedded in IR 13 , with 
reach at least 1/k. For a submanifold S S M D , let Ts(x) denote the tangent space of S at x £ S 1 . 

We will often identify a linear map with its matrix in the canonical basis. For a symmetric 
(real) matrix M, let f3\(M) > ^{M) > • • • denote its eigenvalues in decreasing order. 

We say that / : ft C R D ->■ M D is C-Lipschitz if ||/(aj) - /(y)[| < C[|aj - y||,Vaj,y £ fi. 

For two reals a and 6, a V & = max(o, 6) and a Ab = min(a,6). Additional notation will be 
introduced as needed. 

6.1 Preliminaries 

This section gathers a number of general results from geometry and probability. We took time to 
package them into standalone lemmas that could be of potential independent interest, particularly 
to researchers working in machine learning and computational geometry. 

6.1.1 Smooth surfaces and their tangent subspaces 

The following result is on approximating a smooth surface near a point by the tangent subspace at 
that point. It is based on (Federer, 1959, Th. 4.18(2)). 

Lemma 1. For S £ Sd(n), and any two points s, s' £ S, 

dist(s',T s (s))<*\\s'-s\\ 2 , (16) 

and when dist(s', Ts(s)) < 

dist(«', T s (a)) < k\\P Ts{s) (s') - s\\ 2 . (17) 
Moreover, for t £ T$(s) such that \\s — t\\ < 7/(16k), 

dist(t,5) < n\\t- s\\ 2 . (18) 
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Proof. Let T be short for T s (s). (Federer, 1959, Th. 4.18(2)) says that 

dist(s' - s,T) < ^\\s' - sf. (19) 

Immediately, we have 

dist(s' -s,T) = \\s' - P T (s')\\ =dist(s',r), 
and (16) comes from that. Based on that and Pythagoras theorem, we have 

dist(s',T) = \\P T (s') - S '|| < -sf = ^(\\Pt(s') - s'f + ||P T ( S ') - S || 2 ), 



so that 



dist( S ',T)(l - |dist( S ',r)) < |||P T (s') - s|| 2 , 



and (17) follows easily from that. For (18), let s' = P T 1 (t), which is well-defined by Lemma 3 
below and belongs to B(s, 1/(2k)). We then apply (17) to get 

dist(i,S) < ||t - s'\\ = dist(s',T) < n\\t - s\\ 2 . 

□ 

We need a bound on the angle between tangent subspaces on a smooth surface as a function 
of the distance between the corresponding points of contact. This could be deduced directly from 
(Niyogi et al., 2008, Prop. 6.2, 6.3), but the resulting bound is much looser — and the underly- 
ing proof much more complicated — than the following, which is again based on (Federer, 1959, 
Th. 4.18(2)). 

Lemma 2. For S G Sd(n), and any s, s' G S, 

max (T s (s),T s (s')) < 2asin^|| S / - S || Al) . (20) 
Proof. By (19) applied twice, we have 

dist(s'- s,T s (s)) Vdist(s- s',T s (s')) < ^||s'-s|| 2 . 

Noting that 

dist(t>, T) = \\v\\ sin Z(u, T), (21) 
for any vector v and any linear subspace T, we get 

sin Z(s' - s,T s (s)) V sinZ(s' - s, T s (s')) < ^\\s' - s\\. 

Noting that the LHS never exceeds 1, and applying the arcsine function — which is increasing — 
on both sides, yields 

Z(s' - s,T s (s)) V Z(s' - s,T s (s')) < asin (^\\s' - s\\ A l) . 

We then use the triangle inequality 

e max (T S (s),T s (s')) < Z(s' - s,T s (s)) + Z(s' - s,T s (s')), 

and conclude. □ 
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Below we state some properties of a projection onto a tangent subspace. A result similar to the 
first part was proved in (Arias-Castro et al., 2011, Lem. 2) based on results in (Niyogi et al., 2008), 
but the arguments are simpler here and the constants are sharper. 

Lemma 3. Take S E Sd(n), s £ S and r < and let T be short for Ts(s). Pt is injective on 
B(s,r) n S and its image contains B(s,r') n T, where r' := (1 — |(Kr) 2 )r. Moreover, P^ 1 has 
Lipschitz constant bounded by 1 + ||(Kr) 2 over B(s,r) n T, for any r < 

Proof. Take s',s" E S distinct such that Pt(s') = Pr(s"). Equivalently, s" — s' is perpendicular 
to T s (s). Let T be short for T s (s'). By (19) and (21), we have 

Z(s"-8 , ,T') <asin(|||s"-s / || Al) , 

and by (20), 

max {T,T') < 2 asin falls' Al) . 
Now, by the triangle inequality, 

z( s " - s'X) > z(s" - s ',t) - max (T,r') = | - max (r,T'), 

so that 

asin (~||a" - s'\\ A lj > ~ - 2 asin (~||a' - s\\ A l) . 

When || a' — s\\ < 1/k, the RHS is bounded from below by ir/2 — 2asin(l/2), which then implies that 
fHa" - a' || > sin(7r/2 — 2asin(l/2)) = 1/2, that is, ||s" - a'|| > 1/k. This precludes the situation 
where s' , s" E B(s, l/(2/c)), so that Pt is injective on B(s,r) when r < 1/(2k). 

The same arguments imply that Pt is an open map on R := B(s,r) OS. In particular, Pt{R) 
contains an open ball in T centered at s and Pt(OR) = dPr(R), with dR = S Pi dB(s,r) since 
dS = 0. Now take any ray out of s within T, which is necessarily of the form s + Mv, where v is a 
unit vector in T. Let t a = s + av E T for a E [0, oo). Let a* be the infimum over all a > such that 
t a E Pt(R)- Note that a* > and £ 0>f E Pr(dR), so that there is E 9-R such that Pt(s*) = t a „. 
Let s(a) = P 7 7 1 (s + av), which is well-defined on [0, a*] by definition of a* and the fact that Pt is 
injective on R. We have that s(a) = Dt a P^ 1 v is the unique vector in T a := Ts(P^f 1 (t a )) such that 
Pr(s(a)) = v. Elementary geometry shows that 

||Pr(a(a))|| = \\s(a)\\ cos Z(a(a),T) > ||s(a)|| cos max (T a , T), 

with 

by (20), ||s(a) — a|| < r and cos[2 asin(a?)] = 1 — 2x 2 when < x < l. Since ||Pr(a(a))|| = ||w|| = 1, 
we have ||a(a)|| < 1/C, and this holds for all a < a*. So we can extend s(a) to [0, a*] into a Lipschitz 
function with constant 1/C- Together with the fact that s* E dB(s,r), this implies that 

r = Us* - a|| = ||a(a*) — s(0)|| < -||a*t>|| = 

Hence, a* > C r an d therefore Pt(R) contains B(s, £r) n T as stated. 

For the last part, fix r < j^k, so there is a unique h < 1/(2k) such that ^/t = r, where C is 
redefined as C := 1 - ^(^/i) 2 . Take t' E S(a,r) n T and let a' = Pf x (t') and T' = Is(a'). We 
saw that P^ 1 is Lipschitz with constant 1/C on any ray from s of length r, so that ||a' — a|| < 



cos 6» max (T a ,T) > cos 



2 asm [ — ||a(a) 
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(l/£)||t' — s\\ < r/Q = h. The differential of Pt at s' is Pt itself, seen as a linear map between T' 
and T. Then for any vector u G T', we have 

||P T (w)|| = ||tt||cosZ(«,T) > ||w||cos0 max (T',T), 

with 

cosO max {T',T)>cos 2asin Q||s' - s||) > 1 - ^{nhf = £ 

as before. Hence, ||D t /Py 1 || < and we proved this for all £' G B(s,r) n T. Since that set is 
convex, we can apply Taylor's theorem and get that P^" 1 is Lipschitz on that set with constant 
We then have 

P.A 

l/Q<l + {Khf<l + -{KT)\ 

because nh < 1/2 and r = (h > 7h/8. □ 
6.1.2 Volumes and uniform distributions 

Below is a result that quantifies how much the volume of a set changes when applying a Lipschitz 
map. This is well-known in measure theory and we only provide a proof for completeness. 

Lemma 4. Suppose $7 is a measurable subset ofM. D and f : $7 C M. D — > M. D is C -Lipschitz. Then 
for any measurable set A C O and real d > 0, voLj(/(-A)) < C d vold(A) . 

Proof. By definition, 

vol d (A) = Jim Vl{A), V*(A) := g £ diam^)", 

1 iGN 

where TZ l (A) is the class of countable sequences (Ri-.ie N) of subsets of R D such that A C |J i Pj 
and diam(Pj) < t for all i. Since / is C-Lipschitz, diam(/(P)) < C diam(P) for any R C 0,. Hence, 
for any (Pi) G ft* (A), (/(Pi)) G K Ct (f(A)). This implies that 

< £diam(/(P i )) d < C d £diam(P,) d . 

ieN ieN 

Taking the infimum over (Pi) G ft* (A), we get Vp(f(A)) < C d V%(A), and we conclude by taking 
the limit as t ->• 0, noticing that ^(/(A)) ->• vol(/(A)). □ 

We compare below two uniform distributions. For two Borel probability measures P and Q on 
M. D , TV(P, Q) denotes their total variation distance, meaning, 

TV(P,Q) = sup{|P(A) - Q(A)| : A Borel set}. 

Remember that for a Borel set A, denotes the uniform distribution on A. 

Lemma 5. Suppose A and B are two Borel subsets ofM. D . Then 

TV(A^,A B )<4 vol( ^ UjB) . 
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Proof. If A and B are not of same dimension, say dim(^4) > dim(i?), then TY(X A , Xb) = 1 since 
X A (B) = while \b{B) = 1. And we also have 

vo\(A A B) = vo\ dim{A) (AAB) = vol dim{A) (A) = vol(A), 

and 

vol(A UB) = vol dim{A) (A UB)= vol dim[A) (A) = vo\(A), 

in both cases because voLj im (^) (B) = 0. So the result works in that case. 

Therefore assume that A and B are of same dimension. Assume WLOG that vol(vl) > vol(£?). 
For any Borel set U, 

\a(U) - X B (U) = 

so that 

i a ,m \ , rT M ^ | volant/) -vol(BnC7)| 1 1 

vol(A) 



vol(A) vol(B) 



< vol(A AB) + vol(J3 n U) | vol(A) - vol(B) \ 



< 



vo\(A) vol(B) vol(A) 

2vol(A A B) 



vo\(A) ' 

and we conclude with the fact that vol(^4 UB)< vol(A) + vol (2?) < 2 vol(A). □ 

We now look at the projection of the uniform distribution on a neighborhood of a surface onto 
a tangent subspace. For a Borel probability measure P and measurable function / : M. D — > WL D , P* 
denotes the push-forward (Borel) measure defined by Pf (A) = P(f^ 1 (A)). 

Lemma 6. Suppose A C R"° is Borel and f : A — > R is invertible on f(A), and that both f and 
/ _1 are C-Lipschitz. Then 

TY(X f A ,X nA) )<8(C di ^-l). 

Proof. First, note that A and f(A) are both of same dimension, and that C > 1 necessarily. Let d 
be short for dim(A). Take U C f(A) Borel and let V = / -1 (17). Then 

/ vol(AnV) vol(/(A) n U) 

Xa{U) = vol(A) ' A ^ (C/) = vol(/(A)) ' 

/, m x ,m,/ |vol(An^)-vol(/(A)ntr)| , |vol(A)-vol(/(^))| 



\K(u) - \ fi A)(u)\ < > ~i7T\ — - + 

ertibk 
Lemma 4, we get 



vol(A) vol(A) 
/ being invertible, we have /(iflV) = /(A) n U and / _1 (/(yl) n U) = An V. Therefore, applying 



d vol(/(A) n £7) d 

- voi(^ny) - ' 

so that 

| vol(4 n V) - vol(/(A) n U)\ < {C d - 1) vol(A n V) < (C d - 1) vol(A). 

Similarly, 

| vol(A) - vol(/(^))| < (C d - 1) vol(A). 
We then conclude with Lemma 5. □ 
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Now comes a technical result on the intersection of a smooth surface and a ball. 



Lemma 7. There is a constant Cj > 3 depending only on d such that the following is true. Take 
S G Scl{k), r < and x G M. D such that dist(x, S) < r. Let s = Ps(x) and T = Tg(s). Then 

vol(P T (Sr\B(x,r))A(Tr\B(x,r))) < C 7 (||a; - s|| + r 2 ) vol(T n B(x, r)). 

Proof. Let A r = B(s,r), B r = B(x,r) and g = P T for short. Note that T n B r = T n 
where ro := (r 2 — (5 2 ) 1 / 2 and 5 := ||a? — s\\. Take si G S D B r such that (/(si) is farthest from 
s, so that (7(5 n B r ) C A ri where r\ := ||s — <?(si)|| - note that r\ < r. Let l\ = \\si — g(si)|| 
and ^ be the orthogonal projection of si onto the line By Pythagoras theorem, we have 

\\x — s\ || 2 = || a; — 2/! || 2 + \\yi — s\ II 2 . We have ||a; — s%\\ < r and \\y 1 — si|| = ||s — <7(si)|| = r%. And 
because l\ < nr 2 < r by (17), either is between x and s, in which case \\x — y 1 \\ = 6 — £%, or s is 
between a; and y l5 in which case 1 1 sc — 3/i 1 1 = S + £\. In any case, r 2 > r\ + (5 — li) 2 , which together 
with^i < nr 2 implies r\ <r 2 — 5 2 + 28£\ < t-q + 2Kr 2 e), leading to r\ < (1 — 2K<5) _1 / 2 ro < (l + 4«5)ro 
after noticing that 5 < r < 1/(3k). From g(S n B r ) C T n A n , we get 



vol(s(5nfl r )\(Tn.B r )) < vol(T n A ri ) — vol(T n A 



((ri/r ) d - l)vol(Tn A. 



ro, 
ro)- 



We follow similar arguments to get a sort of reverse relationship. Take S2 G S D B r such 
that g(S n B r ) D T n A- 2 , where r2 := ||s — 5(^2) || is largest. Assuming r is small enough, by 
Lemma 3, g~ l is well-defined on T n ^4 r , so that necessarily S2 G <9-B r . Let £2 = \\ s 2 — 5(^2)11 
and y 2 be the orthogonal projection of S2 onto the line (x,s). By Pythagoras theorem, we have 
\\x — S2W 2 = \\x — y 2 \\ 2 + \\y 2 — 1 1 2 - We have ||a; — S2II = v and ||y 2 — S2 1| = ||s — 5(^2) || = t%. 
And by the triangle inequality, \\x — y 2 \\ < \\ x ~ s ll + II2/2 — s ll = <5 + ^2- Hence, r 2 < r 2 + (6 + i^) 2 , 
which together with ^ 2 < fWj by (17), implies rf > r 2 - <5 2 - 2<% — £\>r\ — (25 + «;r 2 )Kr 2 , leading 
to r 2 > (1 + 2k5 + K 2 r 2 ) _1 / 2 r > (1 - 2k5 - K 2 r 2 )r . From g(S D B r ) D T D A r2 , we get 

vol((mB r ) \g(SDB r )) < vol(Tn A ro ) - vol(Tfl A r2 ) 

= (l-(r 2 /r )Vol(TnA ro ). 

All together, we have 

vol( 5 (5n5 r )A(TnB r )) < ((n/r ) d - (ra/ro^) vol(TnA ro ) 

< ((1 + AK5) d - (1 - 2k<5 - K 2 r 2 ) d ) vol((T n 5 r )), 

with (1 + Anr) d — (1 — 4Kr) d < C{5 + r 2 ) when 5 < r < 1/(3k), for a constant C depending only 
on d and k. The result follows from this. □ 

We bound below the d-volume of a the intersection of a ball with a smooth surface. Though it 
could be obtained as a special case of Lemma 7, we provide a direct proof because this result is at 
the cornerstone of many results in the literature on sampling points uniformly on a smooth surface. 

Lemma 8. Suppose S G Sd(K). Then for any s G S and r < rg^gj^ , we have 

vol(SnB(s,r)) 
1 - 2dnr < — „) ' {{ < 1 + 2ckr, 
~ vol(TnB(s,r)) ~ 

where T := Ts(s) is the tangent subspace of S at s. 
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Proof. Let T = Tg(s), B r = B(s,r) and g = Pt for short. By Lemma 3, g is bi-Lipschitz with 
constants (1 + nr)~ l and 1 on S Pi B r , so by Lemma 4 we have 

voI(,( S n|)) al 

vo1(d Pi ±f r ) 

That is Lipschitz with constant 1 + nr on g(SnB r ) also implies that g(S(lB r ) contains TnB r / 
where r' := r/(l + nr). From this, and the fact that g(S n B r ) cTfl -B r , we get 

voi(rn£ r ) voi(Tni? r ) _ _ d 
- vol(<7(5 n 5,)) " vol(r n b^) ~ '*- [L + Kr} ■ 



We therefore have 



vol(5nS r ) > vol(s(S n B r )) > (1 + Kr)- d vol(TnB r ), 



and 

vol(SnB r ) < (1 + Kr) d vol(g{S H B r )) < (1 + Kr) d vol(T n B r ). 
And we conclude with the inequality (1 + x) d < 1 + 2dx valid for any x G [0, 1/cZ] and any c? > 1. □ 

We now look at the density of a sample from the uniform on a smooth, compact surface. 

Lemma 9. There is a constant Cg > such that the following is true. If S G Sd(n) and we sample 
n points S\, . . . , s n independently and uniformly at random from S, and if < r < 1/(Cqk), then 
with probability at least 1 — Cgr~ d exp(— nr d /Cg), any ball of radius r with center on S has between 
nr d /Cg and Cgnr d sample points. 

Proof. For a set R, let N(R) denote the number of sample points in R. For any R measurable, 
N(R) ~ Bm(n,pn), where pr := vol(R n S)/vol(S). Let xi, . . . ,x m be an (r/2)-packing of S, 
and let Bi = B(xj,r/4) n S. For any s G S, there is j such that \\s — Xj\\ < r/2, which implies 
Bi C B(s,r) by the triangle inequality. Hence, min sg s N(B(s, r)) > minjA^(Bj). 
By the fact that 2?j n Bj = for j / j, 



K5 1 ) > Vvol(Si) > mminvol(5i), 



vo 

1=1 



and assuming that r is small enough that we may apply Lemma 8, we have 

minvol(Si) > — (r/4) d , 
i 2 

where is the volume of the ci-dimensional unit ball. This leads torn < Cr~ d and p := minj pB t > 
r d j C, where C > depends only on S. 

Now, applying Bernstein's inequality to the binomial distribution, we get 

P (N(Bi) < np/2) < F (N(Bi) < np B ./2) < e ~(3/32)n PBi < g -(3/32)n P> ^ 23 ) 
We follow this with the union bound, to get 

minN(B(s,r)) < nr d /(2C)] < m e - (3/32)np < Cr- d e~^c nrd . 
seS J 

From this the lower bound follows. The proof of the upper bound is similar. □ 
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Next, we bound the volume of the symmetric difference between two balls. 
Lemma 10. Take x, y G R d and < S < 1. Then 

vol(B(x,S)AB(y,l)) 



2vol(£(0,l)) 



<l-(l-\\x-y\\) d + A6 a 



Proof. It suffices to prove the result when \\x — y\\ < 1. In that case, with 7 := (1 — \\x — y\\) Ad, 
we have B(x, 7) C B(x, 5) n B(y, 1), so that 

vol(B(aj, 6) A B(y , 1)) = vol(S(aj, 5)) + vo\{B(y, 1)) - 2 vol(S(aj, i)n%l)) 

< 2vol(B(y,l))-2vol(B(a!,7)) 
= 2vol( J B(y,f))(l- 7 d ). 

□ 

6.1.3 Covariances 

The result below describes explicitly the covariance matrix of the uniform distribution over the 
unit ball of a subspace. 

Lemma 11. Let T be a subspace of dimension d. Then the covariance matrix of the uniform 
distribution on T n B(0, 1) (seen as a linear map) is equal to cPt, where c := ^5 • 

Proof. Assume WLOG that T = M. d x {0}. Let X be distributed according to the uniform distri- 
bution on T n B(0, 1) and let R = \\X\\. Note that 

By symmetry, K(XiXj) = if i 7^ j, while 

E(Xf) = I E(Xf + -..+X 2 d ) = - E{R 2 ) = - [ r 2 ■ dr^dr 
d d d Jq 



d + 2 

This is exactly the representation of xro-Pr in the canonical basis of M. D . □ 



We now show that a bound on the total variation distance between two compactly supported 
distributions implies a bound on the difference between their covariance matrices. For a measure 
P on R D and an integrable function /, let P(f) denote the integral of / with respect to P, that is, 



P(f) = J f(x)P(dx), 



and let E(P) = P(x) and Cov(P) = P(xx T ) - P(x)P(x) T denote the mean and covariance matrix 
of P, respectively. 

Lemma 12. Suppose A and v are two Borel probability measures on M. d supported on B(0, 1). Then 
-E(v)\\ < VdT\(X, v), ||Cov(A)-Cov(i/)|| < 3dTV(A, z/). 
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Proof. Let fk(t) = tfc when t = (ti, . . . ,td), and note that |/fe(t)| < 1 for all k and all t G B(0, 1). 
By the fact that 

TV(A, v) = sup{A(/) - v{f) : f : R d -»• R measurable with |/| < 1}, 

we have 

\\(f k )-Hfk)\<TV(\,v), Vfc = l,...,d. 

Therefore, 

d 

|| E(A) - E(^)|| 2 = £(A(/ fc ) - KA)) 2 < dTV(A, v)\ 
h=\ 

which proves the first part. 

Similarly, let fki{t) = t^tl- Since |/h(*)| < 1 for all /c,^ and all t G 5(0, 1), we have 

|A(/ M ) <TV(A, i/), Vfe,^=l,...,d. 

Since for any probability measure /j on K" 1 , 

Cov(/i) = (m(//«) - Kfk)Kfi) : k,£ = l,...,d), 

we have 

|| Cov(A) - CovMH < d max (|A(/ H ) - + |A(/ fc )A(//) - v(fk)"(fe)\) 

< dmax (\X(f ke ) - u(f ki )\ + \\{f k )\\\(f e ) - v{fi)\ + Hh)\\\(f k ) - v(f k )\) 

< 3dTV(\,v), 

using the fact that |A(/fc)| < 1 and \f{fk)\ < 1 for all k. □ 

Next we compare the covariance matrix of the uniform distribution on a small piece of smooth 
surface with that of the uniform distribution on the projection of that piece onto a nearby tangent 
subspace. 

Lemma 13. There is a constant C13 > depending only on d such that the following is true. Take 
S G S d (n), r < ^ and x G R D such that dist(x,S) < r. Let s = P s {x) and T = T s {s). If C 
and £ are the means, and M and N are the covariance matrices of the uniform distributions on 
S n B(x, r) and T n B(x, r) respectively, then 

IIC- £11 < Ci 3 Kr 2 , \\M - N\\ < C 13 Kr 3 . 

Proof. We focus on proving the bound on the covariances, and leave the bound on the means - 
whose proof is both similar and simpler — as an exercise to the reader. Let T = Tg(s), B r = B(x, r) 
and g = Pt for short. Let A = S n B r and A' = T n B r . Let X ~ X A and define Y = g(X) with 
distribution denoted A^. We have 

Cov(A:) - Cov(Y) = * ( Cov(X - Y, X + Y) + Cov(X + Y,X-Y)), 

where Cov(U, V) = E((f7 — Pu)(V — /-ty) T ) is the cross-covariance of random vectors U and V with 
respective means ix n and fx v . Note that by Jensen's inequality, the fact ||wi> T || = ||u||||v|| for any 
pair of vectors u, v, and then the Cauchy-Schwarz inequality 

|| Cov(C7, V)\\ < E(\\U -puW- \\V - /xy ||) < E{\\U - ^|| 2 ) 1/2 ■ E(||V - /xyf) 1 / 2 . 
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Hence, letting fi x = EX and n Y = EF, we have 



||Cov(A A )-Cov(A^)|| < \\Cov(X -Y,X + Y)\\ 

< E [\\X -Y-Vx + Vy II 2 ] 1/2 E [\\X + Y-» X - n Y || 2 ] 1/2 

< E[||X-y|| 2 ] 1/2 (E[||X- S || 2 ] 1/2 +E[||y- S || 2 ] 1/2 ) (24) 

< — r [r + r) = nr , 

where the third inequality is due to the triangle inequality and fact that the mean minimizes the 
mean-squared error, and the third to the fact that X,Y £ B r and (16). 

Assume r < 1/((CV V d)/t). Let \ g (A) denote the uniform distribution on g(A). X 9 A and \ g (A) 
are both supported on B r , so that applying Lemma 12 with proper scaling, we get 

|| Cov(A^) - Cov(A g(A) )|| < 3dr 2 TV(\%\ g{A) ). 

We know that g is 1-Lipschitz, and by Lemma 3 — which is applicable since Cj > 3 — g~ l is 
well-defined and is (1 + Kr)-Lipschitz on B r . Hence, by Lemma 6 and the fact that dim(A) = d, 
we have 

TV(A^, X g(A) ) < 8((1 + Kr) d - 1) < IQdnr, 

using the inequality (1 + x) d < 1 + 2dx, valid for any x G [0, 1/d] and any d > 1. 

Noting that A^' is also supported on B r , applying Lemma 12 with proper scaling, we get 

II Cov(X g(A) ) - Cov(X A ,)\\ < 3dr 2 TY(\ g(A) ,\ A ,), 

with 

TV(A 9(A) ,A.,,<4^Ml<C» r , 

by Lemma 5 and Lemma 7, where C depends only on d, k. 
By the triangle inequality, 



\M - N\\ = || Cov(Aa) - Cov(A A 
Cov(Aa) - Cov(A^ 
< Kr 3 + A8d 2 K,r 3 + Cr 3 . 



< || Cov(A A ) - Cov(A«)|| + || Cov(A^) - Cov(A ff(A) )|| + || Cov(X g(A) ) - Cov(A A , 



From this, we conclude. □ 

Next is a lemma on the estimation of a covariance matrix. The result is a simple consequence of 
the matrix Hoeffding inequality of Tropp (2012). Note that simply bounding the operator norm by 
the Frobenius norm, and then applying the classical Hoeffding inequality (Hoeffding, 1963) would 
yield a bound sufficient for our purposes, but this is a good opportunity to use a more recent and 
sophisticated result. 

Lemma 14. Let C m denote the empirical covariance matrix based on an i.i.d. sample of size m 
from a distribution on the unit ball ofW 1 with covariance S. Then 

P(||C m -S|| >t) <4dexp (-^ mm (^>^) J ■ 
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Proof. Without loss of generality, we assume that the distribution has zero mean and is now 
supported on -B(0, 2). Let xi, . . . , x m denote the sample, with Xi = . . . , Xj,d). We have 



where 



Cm — C m XX 

m 



^ m - m 

C m ■ / , x i x i 7 x ■ / , x i 

i=i i=i 



Note that ^ 

||C m — Sll < llci, — Sll H — ll^ll 2 - 

m 

Applying the union bound and then Hoeffding's inequality to each coordinate — which is in [—2, 2] 
we get 

d / 4.2 



P(INI > t) < 5^P(|zj-| > t/Vd) < 2dexp 



v 8d 

j=i 

Noting that ^(x^xf — S),z = l,...,m, are independent, zero-mean, self-adjoint matrices with 
spectral norm bounded by 4/m, we may apply the matrix Hoeffding inequality (Tropp, 2012, 
Th. 1.3), we get 

P (\\C* m - S|| > t) < 2dexp f-j^a) . °" 2 := rn(A/mf = 16/m. 
Applying the union bound and using the previous inequalities, we arrive at 
P(||C m -S|| >t) < P(||C*-£|| >t/2)+p(||jc|| > y/mt/2 



mt 2 \ , ( m 2 t 



< 2d exp — — — - + 2d exp 



512 J r \ I6d 



( m t ft m \\ 
< 4a exp mm — , — 

P V 16 V 32' d > ) 



□ 



6.1.4 Projections 

We relate below the difference of two orthogonal projections with the largest principal angle between 
the corresponding subspaces. 



Lemma 15. For two affine non-null subspaces T, T' , 
\\P T - P T ,\\ = 




(T, T'), if dim(T) = dim(T'), 
otherwise. 

Proof. For two affine subspaces T,T' C M. D of same dimension, let | > 6\ > • • • > dp > 0, denote 
the principal angles between them. By (Stewart and Sun, 1990, Th. 1.5.5), the singular values of 
Pt — Pt' are {sin #7 : j = 1, . . . , g}, so that \\Pt — Pt'W = max,- sin 8j = sin#i = sin 9 max (T, T'). 
Suppose now that T and T' are of different dimension, say dim(T) > dim(T'). We have \\Pt—Pt' || < 
\\Pt\\ V H-Pt'II = 1) since Pt and Pj" are orthogonal projections and therefore positive semidefinite 
with operator norm equal to 1. Let L = Pt(T'). Since dim(L) < dim(T') < dim(T), there is 
ueTn L 1 - with u / 0. Then v T u = P T (v) T u = for all v G T', implying that Pt'(u) = and 
consequently (Pt — Pt>)u = u, so that \\Pt — Pt'W > 1- D 
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The lemma below is a perturbation result for eigenspaces and widely known as the sin G The- 
orem of Davis and Kahan (1970). See also (Luxburg, 2007, Th. 7) or (Stewart and Sun, 1990, 
Th. V.3.6). 

Lemma 16 (Davis and Kahan). Let M be positive semi- definite with eigenvalues fix > 02 > ■ ■ ■ ■ 
Suppose that := (3^ — fid+i > 0. Then for any other positive semi- definite matrix N , 

H p W p (d) l{< V2\\N-M\\ 

11 N M II ^ Ad 

where pjff an d Pm denote the orthogonal projections onto the top d eigenvectors of M and N , 
respectively. 



6.1.5 Intersections 

We start with an elementary result on points near the intersection of two affine subspaces. 

Lemma 17. Take any two linear subspaces Ti,T 2 C MP . For any point t\ G T\ \ T2, we have 

dist(ti,T 2 ) > dist(ti,rinT 2 ) sin 0^(71, T 2 ). 

Proof. We may reduce the problem to the case where T x n T 2 = {0}. Indeed, let f x = T x D T^, 
f 2 = Tj 1 n T 2 and ti = t x - P TinT2 (ti). Then 

-P T2 (ti)|| = \\h - P f2 (ti)\\, ||ti -P Tl nr 2 (ti)|| = ||*l||, sin0 min (Ti,T 2 ) = sin^ min (fi, T 2 ). 

So assume that T\ n T 2 = {0}. By (Afriat, 1957, Th. 10.1), the angle formed by t\ and PT 2 {t\) 
is at least as large as the smallest principal angle between T\ and T 2 , which is ra \ n {Ti,T2) since 
T\ H T 2 = {0}. From this the result follows immediately. □ 

The following result says that a point cannot be close to two compact and smooth surfaces 
intersecting at a positive angle without being close to their intersection. Note that the constant 
there cannot be solely characterized by k, as it also depends on the separation between the surfaces 
away from their intersection. 

Lemma 18. Suppose S%, S 2 £ <Sd(n) intersect at a strictly positive angle and that reach(S*i n S 2 ) > 
1/k. Then there is a constant Cig such that 

dist(£c, Si n S 2 ) < Ci 8 max{dist(s,5i),dist(a;,5 2 )}, S 1 D . (25) 

Proof. Assume the result is not true, so there is a sequence (x n ) C such that dist(a; n , SinS 2 ) > 
nmaxfc dist(a; n , S^). Because the surfaces are bounded, we may assume WLOG that the sequence 
is bounded. Then dist(a3 n , S% n S 2 ) is bounded, which implies max^ dist(a; n , Sk) = 0(l/n). This 
also forces dist(a; n , Si n S 2 ) — > 0. Indeed, otherwise there is a constant C > and a subsequence 
(av) such that dist(av, Si n S 2 ) > C. Since (x n i) is bounded, there is a subsequence (av) that 
converges, and by the fact that max^ dist(a; n », Sk) = o(l), and by compactness of Sk, the limit 
is necessarily in S x D S 2 , which is a contradiction. So we have dist(a: n ,Si n S 2 ) = o(l), implying 
maxfc dist(a: n , Sk) = o(l/n). 

Assume n is large enough that dist(a3 n , Si D S 2 ) < 1/k and let be the projection of x n onto 
Sk, and st, the projection of x n onto S x n S 2 . Let Tk = Ts k (sn) and note that 9 m i n {Ti,T 2 ) > 9, 
where 6 > is the minimum intersection angle between Si and S 2 defined in (10). Let t„ be the 
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projection of 4 onto T^. Assume WLOG that \\t\ — 4|| > ||i 2 — 411- Let t n denote the projection 
of t\ onto T\ n T2, and then let s n = Ps 1 ns 2 (tn)- 
By assumption, we have 



We start with the RHS 



nmax ||a;„ — s n || < \\x n — 4ll = o(l). (26) 



||£C n — s*||= min ||a: n — s\\ < \\x n - s n \\, (27) 
se5ins 2 

and first show that \\x n — s n \\ = o(l) too. We use the triangle inequality multiple times in what 
follows. We have 

H^n S n \\ ^ \\x n ~\~ \\^n t n \\ + \\t n t n \\ + ||t n S n ||. (28) 

From (26), \\x n — 411 = o(l) and \\x n — s n \\ = o(l), and so that by (16), 

114 - ti[| < 4sl - 4f < 2k(||4 - ag| 2 + \\x n - sif) = o(l). (29) 

We also have 

Il4 ~ t n \\ = min H4 - t|| < ||4 - 4ll < 114 - 4ll + 114 - x n\\ + \\x n - 4ll = o(l), (30) 

where the first inequality comes from s n G T\ n T2 . Finally, 

||*n - s n || = min ||t„ - s|| < ||t„ - 4ll < \\tn - 4ll + Il4 _ 4ll = °(1)> 
se5in5 2 

where the first inequality comes from 4 £ Si PI S^- 

We now proceed. The last upper bound is rather crude. Indeed, we use (18) for 5 = Si n S2 
and s = 4, noting that Ts in s 2 (4) = T\ (IT2 and ||t n — 4|| = o(l), and get 

||tn ^n|| ^H^ra ^nll — ^(ll^n ^n|| \\^n ""nil H^n 4ll) • 

We have ||x n — 4|| = \\x n — Ps 1 nS 2 ( x n)\\ < \\x n — s n|| because s n € T\ n T2. This leads to 

||*n - »n II < «(||*n - s n || + 2\\s n - a:„||) 2 < \K\x n - s n || 2 , (31) 

eventually, since \\t n — s n \\ = o(l). 

Combining (28), (29) and (31), we get 

||«"n Sn|| ^ \\x n S n || + 0(||iC n S n || + \\x n &n\\ ) \\t n t n || + 0(||a? n ^n|| )j 

which leads to 

||aJn - Sn|| < 2||aj„ - 411 + 2||4 - M> (32) 
when n is large enough. Using this bound in (26) combined with (27), we get 

ill n ~ 2 k ii 

1 1 t n ~ tn 1 1 ^ n max | \x n — s n \\. 

Z k 
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We then have 

II k \\ ^_ ^ II 1 2 II 

max \\x n s n || > —\\s n — s n \\ 

k z 

— 2^^ n _ ~~ II s ™ — *nll — ll s n — *nll) 

> idist(4,r 2 )- 

with 

114 - <ll = 0{\\x n - 4f + ||aj n - 4f) = 0(Hsb„ - s n \\ 2 ) = 0{\\t l n - t n \\ 2 ), 

due (in the same order) to (29), (26)-(27), and (32). Recalling that [|4 — t n \\ = dist(t^,Ti P1T2), 
we conclude that 

dist(4, T 2 ) = Ofl/n) dist(4, ?i n T 2 ) + 0(1) dist(4, 2i n T 2 ) 2 . 

However, by Lemma 17, dist(4i I2) > (sin 0) dist(4) H T 2 ), so that dividing by dist(4> I2) above 
leads to 1 = 0(l/n) + 0(1) dist(4,T 2 ), which is in contradiction with the fact that dist(t^,T 2 ) < 
||4 — t n || = o(l), established in (30). □ 

6.1.6 Covariances near an intersection 

We look at covariance matrices near an intersection. We start with a continuity result. 

Lemma 19. Let T\ and T 2 be two linear subspaces of same dimension d. For x E T\, denote 
by 5j(£c) the covariance matrix of the uniform distribution over B(x, 1) n (T\ U T 2 ). Then, for all 
x,y e Ti, 

5d \\x — y\\, if d> 2, 
^6\\x-y\\, ifd=\. 



\V(x)--£(y)\\ < 



Proof. Since, by Lemma 11, S(as) = cP^ for all x £ T\ such that dist(a5, T 2 ) > 1, we may assume 
that dist(sr,ri) < 1 and dist(y,Ti) < 1. Let d = dim(Ti) = dim(T 2 ) and A j x = B(x, 1) n Tj for 
any x and j = 1,2. By Lemma 12 and then Lemma 5, we have 



A h uA l' 



< 4 



|E(aO-E(t/)|| = ||Cov(A^ uA| )-Cov(A 
< TV(A A i uA |, A A i uA 2) 

| vol((^U^)A(4u4)) 
vol((^iU^)U(^iU^)) 
vol(^A4) + vol(^A4) 
vol(Al) 

Note that A^ is the unit-radius ball of T\ centered at x, while A 2 , is the ball of T 2 centered 
at X2 := Pt 2 (x) and of radius n := y/l — \\x — a: 2 || 2 . Similarly, Ay is the unit-radius ball of T\ 
centered at y, while A y is the ball of T 2 centered at y 2 := Pt 2 iv) an d of radius 5 := \J\ — ffj/ — y 2 || 2 . 
Therefore, applying Lemma 10, we get 

voK^A^)^ ^ 



2vol(,4i 



<!-(!-*)+, 
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and assuming WLOG that 5 < rj, and after proper scaling, we get 

vol(^A4) 



2vol(,4i 



<(:=V-(ri-hy + A6< 1 , 



where £ := \\x — y\\ and £2 := ||aj2 — 1/2 1 1 ~~ note that £2 < t by the fact that Pt 2 is 1-Lipschitz. 

We have 1 — (1 — t)j£ < dt. This is obvious when £ > 1, while when £ < 1 it is obtained using 
the fact that, for any < s < £ < 1, 

^ - s rf = (£ - s )(£ d -! + s £ d ~ 2 + • • • + <rt + s «*-i) < d£ d - 1 (£ - s) < d{t - s). (33) 
For the second ratio, we consider several cases. 

• When 77 < £21 then £ = 77 < 77 < £2 < £. 

• When £ 2 < 77 < £ 2 + 5, then ( = rf - (77 - £2)^ < d£ 2 < d£- 

• When 77 > £2 + 5 and d > 2, we have 

£ = 77^ - 5 d < ^77(77 - S) < d(n 2 - 5 2 ) 

= (£(|| 2 /-y 2 || 2 -||^-^2|| 2 ) 

= d(\\y - 2/2 II + ||sc - sb 2 ||)(||i/ - y 2 ll - II® - asa[|) 

< 2d(£ + £ 2 ) < 4dt, 

where the triangle inequality was applied in the last inequality, in the form of 

\\y - y 2 \\ < \\y - x\\ + \\x - x 2 \\ + \\x 2 - x\\ = \\x - x-i\\ + t + t 2 . 

• When 77 > £2 + 5 and d = 1, we have 



C = 7]-5 < y/\\y - y 2 \\ - \\x - x 2 \\ < y/t + t 2 < V2t, 
using the same triangle inequality and the fact that, for any < s < t < 1, 



< ^r s - ^i—t = , t s , : < - t s < 4=^ = Vt^~s. 

y/T=S + y/T=t y/1 -t + t-S 

When d > 2, we can therefore bound ||S(x) — S(y)|| by dt + 4dt = 5dt, and when d= 1, we bound 
that by £ + V2t < V6t- □ 

The following is in some sense a converse to Lemma 19, in that we lower-bound the distance 
between covariance matrices near an intersection of linear subspaces. Note that the covariance 
matrix does not change when moving parallel to the intersection; however, it does when moving 
perpendicular to the intersection. 

Lemma 20. Let T% and T 2 be two linear subspaces of same dimension with m [ n (Ti,T 2 ) >9o>0. 
Fix a unit norm vector v £ T\ Pi (T\ Pi T^^. With X(/Vy) denoting the covariance of the uniform 
distribution over B(hv, 1) PI (Ti U T2), we have 

inf sup ||S(/w>) - S(£t>)|| > I/C20, 

h e 

where the infimum is over < h < l/sin#o and the supremum over max(0, h — 1/2) < t < 
min(l/ sin^o, h + 1/2), and C 2 q > depends only on d and 6q. 



30 



Proof. If the statement of the lemma is not true, there are subspaces Ti and T2 of same dimension 
d, a unit length vector v G Ti n (Ti n T2) 1 - and < h < 1/ sin#o, such that 

H(£v) = H(hv) for all max(0, h - 1/2) < £ < min(l/sm0 Oj ft + 1/2). (34) 

By projecting onto (Ti n T^) -1 -, we may assume that T\ n T2 = without loss of generality. Let 
6* = Z(v,T2) and note that 9 > 9q since Ti n T2 = 0. Define u = (v — Pt 2 v )I sin^ and also 
w = Pt 2 v/cosO when 9 < tt/2, and w G T2 is any vector perpendicular to w when 9 = ir/2. 
B(hv, 1) fl Ti is the d-dimensional ball of Ti of radius 1 and center /it;, while — using Pythagoras 
theorem — B(hv, 1) n T2 is the ci-dimensional ball of T2 of radius t := (1 — (/isin#) 2 ) 1//2 and center 
(h cos 9)w. Let X be drawn from the uniform distribution over B(hv,l) n (Ti U T2), while Xq 
and Xq are independently drawn from the uniform distributions over the unit balls of Ti and T2, 
respectively. By Lemma 11, Cov(Xo) = cPt x and Cov(X ) = cPt 2 where c := l/(d+ 2). Also, let 
^ be Bernoulli with parameter a, where 

vol(B(hv, 1) n Ti) _ vol(B(hv, 1) D Ti) _ 1 

" "~ vol(S(/if,l)n(TiUT 2 )) ~ vol(5(/iw, 1) nTi) + vo\(B((h cos 9) w,t) nT 2 ) ~ 1 + t d ' 

We have 

X~^(/iu + X ) + (l-^)((/icos^)w + iX ). 
A straightforward calculation, or an application of the law of total covariance, leads to 

Cov(X) = E(f) Cov(Xo) + E(l - t2 Cov(X ) + Vai(£)h 2 (v - (cos 9)w)(v - (cos 9)w) T , (35) 

which simplifies to 

H(hv) = caP Tl + c(l - a)t 2 P T2 + a(l - a)(l - t 2 )uu T , 

using the fact that i; — (cos#)k; = (sin 0)it and the definition of t. Let 9\ = 9 may[ (Ti, T2) and let 
t>i G Ti be of unit length and such that /_(v\,T2) = 0\. Then for any < h,£< 1/ sin#0) we have 

\\V(hv) - H(£v)\\ > \vJV(hv) Vl - vJV(£v) Vl \ = \f(t h ) - f(t e )\, (36) 

where th ■= (1 — (/isintf) 2 ) 1 / 2 and 

c ct d + 2 (cos 9 X ) 2 ^(l-t^i^Vj) 2 

W) - i + T d + J\ + ¥f ' 

It is easy to see that the interval 

I h = {tt : (h - 1/2)+ <£<(!/ sin 9 ) A (h + 1/2)} 

is non empty. Because of (34) and (36), f(t) is constant over t G Ih, but this is not possible since 
/ is a rational function not equal to a constant and therefore cannot be constant over an interval 
of positive length. □ 

We now look at the eigenvalues of the covariance matrix. 

Lemma 21. Let T\ and T2 be two linear subspaces of same dimension d. For s £ Tj, denote 
by 5j(£c) the covariance matrix of the uniform distribution over B(x, 1) n (Ti U T2). Then, for all 
x G Ti, 

c (l _ (i _ 5 2 (x)) d l 2 ) < p d (E(x)), £1 (£(*)) < c + 5(x)(l - 8 2 (x)) d / 2 , (37) 
^(l-cos9 max (T 1 ,T 2 )) 2 (l-5 2 (x)) d f +1 <p d+1 (V(x)) < (c + 5 2 (x))(l-5 2 (x)) d / 2 , (38) 

o 

where c := l/(d + 2) and 5(x) := dist(a;, T2). 
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Proof. As in (35), we have 

= acP Tl + (1 - a)ct 2 P T2 + a(l - a)(x - x 2 )(x - x 2 ) T , (39) 

where ai2 := Pt 2 {x) and a := (1 + i rf ) -1 with i := (1 — 8 2 (x))]/ 2 . Because all the matrices in this 
display are positive semidefinite, we have 

p d (E(x)) > ac\\P Tl \\ = ac, 

with a > 1 — t d . And because of the triangle inequality, we have 

/Si(S(aj)) < ac||P Tl || + (1 - a)ct 2 \\P T2 \\ + a(l - a)[|ae - x 2 \\ 2 < c + a(l - a)5 2 (x), 

with a(l — a) < t . Hence, (37) is proved. 

For the upper bound in (38), by Weyl's inequality (Stewart and Sun, 1990, Cor. IV.4.9) and 
the fact that /3d+i(-Pri) = 0, and then the triangle inequality, we get 

(3 d+1 {V(x)) < \\V(x) - acP Tl \\ 

< c(l-a)t 2 \\P T2 \\+a(l-a)5 2 (x) 

< (l-a)(c + 5 2 (x)), 

and we then use the fact that 1 — a < t . For the lower bound, let 9\ > 9 2 > • • • > 6$ denote 
the principal angles between T\ and T 2 . By definition of principal angles, there are orthonormal 
bases for T\ and T 2 , denoted v\, . . . ,v d and w\, . . . ,w d , such that vjw^ = TLj=k • cos6j. Take 
u E span(ui, . . . , v d , w±), that is, of the form u = av\ + v + bw±, with v £ span(v2, • • • , vj). Since 
Ptx = VivJ + • • • + VjvJ and Pt 2 = wiwj + • • • + w^wj, we have 

-ii T S(a;)it > a(a 2 + \\v\\ 2 + 2abcos6i + 6 2 cos 2 6»i) + (1 - a)t 2 (b 2 + 2abcos8 1 + a 2 cos 2 6 \) 
c 

= a(a + bcos9i) 2 + (1 - a)t 2 (acos#i + b) 2 + a(l - a 2 - b 2 ), 

assuming \\u\\ 2 = a 2 + \\v\\ 2 + b 2 = 1. If \a\ V |6| < 1/2, then the RHS > a/2 > 1/4. Otherwise, 
the RHS > (1 — a)t 2 (l - cos6'i) 2 /4, using the fact that a>l — a>(l — a)t 2 . Hence, by the 
Courant-Fischer theorem (Stewart and Sun, 1990, Cor. IV. 4. 7), we have 

p d+ i{T,(x)) > |(1 - a)t 2 (l - cosfli) 2 , 

with 1 - a > t d /2. This proves (38). □ 

Below is a technical result on the covariance matrix of the uniform distribution on the intersec- 
tion of a ball and the union of two smooth surfaces, near where the surfaces intersect. It generalizes 
Lemma 13. 

Lemma 22. Let Si,S 2 G 5^(k) intersecting at a positive angle, with reach(Si n S 2 ) >1/k. Then 
there is a constant C 22 > 3 such that the following holds. Fix r < 1/C 22 , and for s G Si with 
dist(s,S 2 ) < r, let C(s) and E(s) denote the covariance matrices of the uniform distributions over 
B(s,r) n (Si U S 2 ) and B(s,r) n (Tj U T 2 ), where T x := T Sl (s) and T 2 := T S2 (Ps 2 (s)). Then 

\\C(s)-V(s)\\<C 22 r 3 . (40) 
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Proof. Below C denotes a positive constant depending only on S\ and S 2 that increases with each 
appearance. We note that it is enough to prove the result when r is small enough. Take s 6 5i 
such that 5 := dist(s, S 2 ) < r and let s 2 = Ps 2 ( s ) — n °te that \\s — s 2 \\ = 8. Let B r be short for 
B(s,r) and define A k = B r n Sfc, = E(A J 4 fc ) and Dfc = Cov(A J 4 fc ), for k = 1,2. As in (35), we 
have 

C(s) = aDi + (1 - a)-D 2 + a(l - a)(/x 1 - /u 2 )(A*i ~ J*2) T > 

where 

vol(Ai) 



a := 



vol(Ai) +vol(A 2 )' 

Let Ti = T 5l (s) and T 2 = T S2 (s 2 ), and define AJ. = B r n T fc , so that £ r n (Ti U T 2 ) = A[ U A' 2 . 
Note that E(A^' ) = s and E(A^) = s 2 , and by Lemma 11, := Cov(A J 4' i ) = cr-P^ and 
D' 2 : = Cov(A^) = c(r 2 - 5 2 )P T2 , where c := l/(d + 2). As in (39), we have 

S(s) = a'-Di + (1 - a')£> 2 + a'(l - a')(« " s 2)(s - s 2 ) T , 

where 

volK) 



vol^+vol^) 
Since |o/(l — a') — a(l — a)| < \a! — a\, we have 

||C(a)-S( S )|| < \a'-a\(\\D' 1 \\ + \\D' 2 \\ + \\s-s 2 \\ 2 ) 

+ a||Ui - + (1 - a)||£> 2 - £> 2 || + «(1 - aMH/^ - s|| + ||/x 2 - s 2 | 



< (2c + l)r 2 |a' 



a 



+ \\D 1 - D[\\ V \\D 2 - D' 2 \\ + 2r(||A*i " 4 V \\fi 2 - s 2 \\), 

using the triangle inequality multiple times, and in the first inequality we used the fact that 

\\vv T — «;«; T || < || (u — ii>)i> T || + \\w(v — w) T \\ < (\\v\\ + ||«;||)||« — io||, 

for any two vectors v, w 6 Mr. Assuming that kt < I/C13, by Lemma 13, we have \\fii — s\\ V \\/j, 2 — 
s 2 || < Cisnr 2 and \\D\ — D[\\ V \\D 2 — D 2 \\ < Ci^Kr 3 . Assuming that nr < 1/3, P-jT 1 is well-defined 
and (1 + Kr)-Lipschitz on SkCiB r . And being an orthogonal projection, Pp k is 1-Lipschitz . Hence, 
applying Lemma 4, we have 

vol(A k ) 

1 < rr-r A < l + «r, k = 1,2. 
vol(P Tk (A k )) 



Then by Lemma 7, 



l-CrKr< yol ^jff <l + C 7K r, k = 1,2. 
vol(A') 



So we get 

l-Cr<^<l|C r , . = 1,2. 
vol(A k ) 



Since for all a, b, a , b > we have 



a + b a' + b' 



\a-a>\V\b-b>\ 
~ (a + b) V (a' + 6') 1 J 

< |l-o/o'| V|l-6/6'|, 
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we get 

\a — a'\ < Cr. 

Hence, 

\\C(s)-V(s)\\<Cr 3 , 

so we are done with the proof. □ 
6.2 Performance guarantees for Algorithm 2 

We deal with the case where there is no noise, that is, r = in (11), so that the data points are 
si, . . . , sjv, sampled exactly on Si U S2 according to the uniform distribution. We explain how 
things change when there is noise, meaning r > 0, in Section 6.4. 

Let Hj = {j 7^ i : Sj G N r (si)}, with (random) cardinality N = |Hj|. When there is no noise, 
Cj is the sample covariance of {sj : j G Hi}. For i G [n], let K \ = 1 if Sj G Si and = 2 otherwise, 
and let T\ = Ts K _(si), which is the tangent subspace associated with data point Sj. Given N, 
{sj : j G Ei} are uniformly distributed on S^ n B(si,r), and applying Lemma 14 with rescaling, 
we get that for any t > 

PfllCi-ECiH >r 2 t|iV i ) <4dexp f-^min (t, ^) J , 

for an absolute constant C14 > 1. We may assume that r < 1/(Cqk) and let n+ := nr d /C$. We 
assume throughout that r is large enough that > d, for otherwise the result is void since the 
probability lower bound stated in Theorem 1 is negative. Using Lemma 9, for any t < 1, 

PfllCi-ECiH >r 2 t) < F(\\C i -EC l \\>r 2 t\N i >n i )+F(N i <n i ,) 

< 4dexp(— n+^/Cn) + Cgn exp(— n*) 

< (Ad + C 9 )nexp(-n^ 2 /Ci 4 ). 

Define Sj as the covariance of the uniform distribution on Ti n B(si,r). Let 

/* = {»: ^ = ^, Vj G HJ, 

or equivalently, 

1° = {i : 3j s.t. Kj ^ Ki and ||sj — Sj|| < r}. 

By definition, indexes the points whose neighborhoods do not contain points from the other 
cluster. Applying Lemma 13, this leads to 

\\Ed - Sid < Ci 3 nr 3 , Vi € I*. (42) 

Define the events 

2 

fil = |J {Vs £S k : #{i:K i = k and * G B(s,r/C n )} > n*}, 
fe=l 

where Cq := 100d 2 C 2 , and 

U 2 = {\\Ci-ECi\\ < r 2 t, for all i G [n]} , 
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and their intersection f2 = S7i n O2, where t < 1 will be determined later. Note that, under 
A^i > n*. Applying the union bound, 



CI 



P(f2 c ) < P(J2f)+P(n: 

< Cgn exp(— n*) + n(4d + Cg) exp(— n+t 2 / C14) 

< pn := (4d + 2C 9 )nexp(-n^ 2 /Ci4)- 

Assuming that f2 holds, by the triangle inequality, (54) and (42), we have 

||C f -Ei|| < \\C i -EC i \\ + \\EC i -'Z i \\ <Cr 2 , Vie/*, (43) 

where 

C := t + C 13 Kr. (44) 
The inequality (43) leads, via the triangle inequality, to the decisive bound 

\\Ci-Cj\\ < ||E i -S j ||+2Cr 2 , (45) 

Take i,j £ I* such that Ki = Kj and [|sj — < e. Then by Lemma 11 and Lemma 15, 
property (20) and the fact that sin(2#) < 2sin# for all 9, and the triangle inequality, we have 

^||Ei - Sj-H = Kne max (T i ,r i ) < 2«||a< - ^|| < 2ke, (46) 
where c := l/(d + 2). This implies that 

-4 1| Cj - Cj-|| < 2cne + 2(. (47) 

Therefore, if 77 > 2c«e + 2£, then any pair of points indexed by i,j E I* from the same cluster and 
within distance e are direct neighbors in the graph built by Algorithm 2. 
Take i,j G such that Kj 7^ itj and ||sj — Sj\\ < e. By Lemma 18, 

max [ dist(sj, S± n £2), dist(sj, Si H S 2 )] < Cis||si — Sj\\. 

Let 2 be the mid-point of Si and Sj. By convexity and the display above, 

dist(z, Si n S 2 ) < - dist(si, Si n 5 2 ) + - distfo, £1 n S 2 ) < Ci 8 e. 

Assuming Ci8£ < let s = P^nSa ( z )- Then, by the triangle inequality again, 

1„ „ 1 



max 



\s - Sj||, ||s - 8 S \\\ < dist(z, Si n S 2 ) + 7y\\ s i- s j\\ < c i8£ +„£< (C 18 + l)e. 



Let denote the tangent subspace of Sft^ at s and let 5^ be the covariance of the uniform 
distribution over T[ n B(s,r). Define Tj and H'j similarly. Then, as in (46) we have 

— 2 ||£i - E-ll < K\\si - s\\ < k(Ci 8 + l)e, 
cr z 

and similarly, 

-L\\V 3 -m\<K(C l8 + l)e. 
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Moreover, by Lemma 11 and Lemma 15, 

1 



2 n S- - SjfH = sin 9 max (Tl,Tj) > sin9 s , 
where 9 S is short for 6(Si, S 2 ). Hence, by the triangle inequality, 

^2 - Sj|| > sin# s - 2k(Ci 8 + l)e, (48) 

and then 

-^\\Ci - Cj\\ > csm9 s - 2ck(Ci 8 + l)e - 2Q. (49) 

Therefore, if rj < csin# s — 2ck(C\s + l)e — 2(, then any pair of points indexed by i, j £ I* from 
different clusters are not direct neighbors in the graph built by Algorithm 2. 
In summary, we would like to choose 77 such that 

2cks + 2( < r] < csm9 s - 2ck(Ci 8 + l)e - 2(. 

This holds when 

c sin 9 3 
2cKe + 2(<r l <^—^ 2 , 

(d + 2)n ?7 ri sin#g 

ok b DO13K (G18 + 2){d + 2) 

using the definition of £ in (44) and that of c = l/(d + 2). We choose t = 77/6 and get that 
P(f2 c ) < Cnex.p(—nr d 7] 2 /C), where C depends only on d and C9. 



which is true when 



6.2.1 The different clusters are in different connected components 

We show that Step 3 in Algorithm 2 eliminates all points i ^ I±, implying by our choice of parameters 
in (50) that after that step the two clusters are not connected to each other in the graph. Hence, 
take i $ I* with K \ = 1 (say), so that dist(sj, S 2 ) < r. By Lemma 18, we have dist(sj, S\ n S 2 ) < 
Ci 8 r < 1/k. Assuming that (Ci 8 + l)r < 1/k, let s° = Ps 1 ns 2 ( s i) an d define = T Sk (s°) 

Below, C > is a constant whose value increases with each appearance. By Lemma 22 (and 
the notation there), for s G Si such that dist(s, S 2 ) < (C\s + 1)7", 

\\C(s) - £(a)|| < Cr 3 . 

We now derive another approximation that involves S°(s), the covariance matrix of the uniform 
distribution on B(P T o(s),r) n (T® U T®). For that, we continue with the notation used in the 
proof of Lemma 22 until (52) below. Define ii = P T o(s) and t 2 = ^r°(^l)- Let ^0 = Pi ~~ 
Si = \\s — ti\\, 5 2 = \\s2-t 2 \\ and A° k = T^C\B(ti,r). By Lemma 1, we have 5i < Cr 2 and S 2 < Cr 2 , 
because \\s — s°\\ < Cr by Lemma 18, and 

11^2 — < \\s 2 — s\\ + \\s — S°\\. 

Hence, \Sq — S\ < 5i + 5 2 < Cr 2 . We assume that r is small enoug h that Cr 2 < r, so that A\ / 0. 
Note that E(A A o) = t k and D? := Cov(A A o) = cr 2 P T o, while D§ := Cov(A A o) = c(r 2 - 6%)P T o 
when 5o < r; otherwise = 0. As in (39), we have 

E°(a) = a°£>? + (1 - a°)£>2 + a°(l - a°)(ti - t 2 )(*i - * 2 ) T , (51) 
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where 



,o 



vo 



ry 

' vol(A§) + vol(A°) ' 



This identity remains valid even when A® = 0. As in the proof of Lemma 22, we have 

||S(s) - S°(a)|| < (2c + l)r 2 \a' - a°\ + \\D[ - D?|| V ||£>' 2 - + 2r(||*i - s\\ V ||t 2 - s 2 ||). 
By the triangle inequality and the fact that \\Pr\\ < 1 for any subspace T, 

\\D , 1 -D 1 \\<cr 2 \\Pr 1 -P J? \\ i 

and 

\\D' 2 - D° 2 \\ < cr 2 \\P T2 - P T o\\ + c\5 2 - 5 2 \. 
By Lemma 2 and Lemma 15, we have 

\\ P T! - P T °\\ < K \\ s - s °\\ < Cr i \\ P T 2 ~ P T°\\ < K \\ s 2 ~ < Cr. 

And since \5 2 - 5 2 \ < 2r\5 - S \ < Cr 3 , we have \\D' k - D° k \\ < Cr 3 for k = 1,2. Let uj d denote the 
volume of the (i-dimensional unit ball. Then 



vol(^) = oo d r d , vo\(A' 2 ) = u d (r 2 - 5 2 ) d / 2 , vo\{A\) = u d r 2 , vol(A°) = u d (r 2 - 6 2 ) d / 2 , 
so that 



\a — a°\ 



I 1 + (1 _ £ /r )d/2 1+{1 _ So/r) d/2\ 

< \ { i-S/r) d / 2 -(l-5 /r)f\. 
Proceeding exactly as when we bounded £ in the proof of Lemma 19, we get 



\a - a°\ < dy/\5-6 \/r < Cy/r. 

Hence, we proved that 

||S(s)-S (s)|| <Cr 5 / 2 . 
We conclude with the triangle inequality that 

\\C(s) - E°(,)[| < \\C(s) - E(*)|| + ||S( S ) - S°( S )|| < Cr 5 / 2 . (52) 

Again, this holds for any sgSi such that \\s — s°\\ < (Cis + l)r. 

Assuming that Si ^ s° (which is true with probability one) and s° = 0, let h = \\si — s°\\ and 
v = (si - s°)/h. Note that Sj = hv. Because v J_ Tf f] T 2 °, and that 6 min (T? ,T$) > S , we apply 
Lemma 20 with scaling to find t G h ± r/2 such that - > r 2 C 2 o, where C 2 o > 

depends only on 6 S and d. Letting s = £v, we have ||s — Si\\ = \h — £\ < r/2, so that 

dist(s, Si n 5 2 ) < dist(«i,5i n 5 2 ) + r/2 < (C 18 + l/2)r < 1/k, 

and consequently, Ps in s 2 (s) = s°, by (Federer, 1959, Th 4.8(12)). Hence, by the triangle inequality, 

\\C( Si ) - C(i)|| > - E°(i)|| - [|C(*) - 53°(*OII " " 

> r 2 /C 20 ~ 2Cr 5 / 2 . 
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We apply the same arguments but now coupled with Lemma 19 after applying a proper scaling, to 
get 

||C(5)-C(S)|| < \\^(-s)-^ (s)\\ + \\C(-s)-^°(s)\\ + \\C(s)-i: Cs)\\ 
< 5dr 3 / 2 y/\\s-s\\ +2Cr 5 / 2 , 

for any s G Si such that \\s — s\\ < r/2, since this implies that \\s — s \\ < (Cig + l)r. Hence, when 
\\s - i|| < r/C n and r 1 / 2 < 1/(16CC 20 ), we have 

\\C(s)-C(si)\\ >r 2 (l/C 20 - M^/l/C^ -Wr 1 / 2 ) > r 2 /(4C 20 ). 

Now, under Q, there is Sj G Si n B(s,r/Cn), so that \\Cj — Ci\\ > r 2 /(4C 2 o). Therefore, choosing 
77 such that 77 < 1/(4C2q), we see that \\Cj — Cj|| > 77, even though \\sj — Sj|| < r/2 + r/Cfo < r. 

6.2.2 Each cluster forms a connected component in the graph 

We show that the points that survive Step 3 and belong to the same cluster form a connected 
component in the graph, except for possible spurious points near the intersection. Take, for example, 
the cluster generated from sampling S%. The danger is that Step 3 created a "crevice" within this 
cluster wide enough to disconnect it. We show this is not the case. (Note that the crevice may be 
made of several disconnected pieces.) Before we start, we recall that I£ was eliminated in Step 3, 
so that by our choice of 77 in (50), to show that two points Sj, s,- sampled from Si are neighbors it 
suffices to show that ||s, — Sj\\ < e. 

We first bound the width of the crevice. Let I Q = {i G I* : 3, C By our choice of parameters 
in (50), we see that i G I Q is neighbor with any j G Hi, so that i survives Step 3. Hence, the nodes 
removed at Step 3 are in 1^ := {i : Hj Pi I£ 7^ 0}, with the possibility that some nodes in survive. 
Now, for any i ^ I±, there is j with Kj 7^ K{ such that ||sj — Sj\\ < r, so by Lemma 18, 

dist(sj,Si n S 2 ) < Cis\\si - Sj\\ < Ci$r. 

By the triangle inequality, this implies that dist(sj,Si n S 2 ) < ri := (Cis + l)r for all i G J|. So 
the crevice is along Si n S 2 and of width bounded by n . We will require that e is sufficiently larger 
than r\ , which intuitive will ensure that the crevice is not wide enough to disconnect the subgraph 
corresponding to I Q . Let R := Si \ B(S\ n S 2 ,r 2 ), r 2 = v\ + r = (Cis + 2)r, so that any Sj G Si 
such that dist(sj, R) < r survives Step 3. 

Take two adjacent connected components of R, denoted R\ and i? 2 . We show that there is at 
least one pair ji,j 2 of direct neighbors in the graph such that Sj m G R m . Take s on the connected 
component of Si n S 2 separating i?i and i? 2 - Let T k = Ts k (s) and let H be the affine subspace 
parallel to (T 1 n T 2 ) 1 - passing through s. Take t m G P T i(R m ) n H n 9B(s,ei), where £1 := e/2, 
and define s m = P"^*™)- Note that here t 1 ,* 2 G T 1 and s 1 , s 2 G Si, and by (Federer, 1959, Th 
4.8(12)), P5 1 ns , 2 (t m ) = s. Lemma 3 not only justifies this construction when ke\ < 1/3, it also 
says that has Lipschitz constant bounded by 1 + reei, which implies that 

\\s m - s|| < (1 + /t£i)||t m - s|| = (1 + KEi) ei < e/3, 

when e is sufficiently small. We also have 

dist(s m ,SinS 2 ) > dist(t m , Si n S 2 ) - ||s m -t m || 



\t m -s\ 

K , 



2 



> (1- ^(1 + Kei) 2 ei)ei 

> e/3, 
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when e is sufficiently small. We used (16) in the second inequality. We assume r/e is sufficiently 
small that e/3 > ri + r, Then under Oi, there are ji, j'2 such that Sj m € B(s m ,r) n Si. By the 
triangle inequality, we then have that dist(s Jm , Si n S2) > s/S — r > r2, so that s Jm G i? m , and 

1 1 ^ j\ *^J*2 1 1 1 1 ^ j\ ^ 1 1 "I - 1 1 *^ ^ 1 1 ~~ 1 1 ^ *^ 1 1 "I - 1 1 ^ 1 1 

< r + e/3 + e/3 + r 
2 

= -e + 2r<e, 

because 6r < 3(r + n) < e by assumption. 

Now, we show that the points sampled from R\ form a connected subgraph. (i?i is any connected 
component of R.) Take s 1 , . . . , s M an r-packing of R%, so that 

(J(i?! n B(s m , r/2)) C i?i C (J(i?! n B( S m , r)). 

m m 

Because R\ is connected, L) m B(s m , r) is necessarily connected. Under Oi, and C22 > 2, all the 
balls B(s m , r),m = 1, . . . , M, contain at least one Si G Si, and any such point survives Step 3 since 
dist(sj,i?i) < r by the triangle inequality. Two points Sj and Sj such that Sj,Sj G B(s m ,r) are 
connected, since ||sj — < 2r < e. And when B(s mi ,r) n B(s m2 ,r) ^ 0, Sj G B(s mi ,r) and 
Sj G B(s m2 , r) are such that — Sj|| < 4r < e. Hence, the points sampled from i?i are connected 
in the graph under 

We conclude that the nodes corresponding to R that survive Step 3 are connected in the graph. 
6.2.3 Choice of parameters 

Aside from the constraints displayed in (50), we assumed in addition that 

and that e was sufficiently small. Therefore, it suffices to choose the parameters as in (13) with a 
large-enough constant. 

6.3 Performance guarantees for Algorithm 3 

We keep the same notation and go a little faster here as the arguments are parallel. Let d% denote 
the estimated dimensionality at point Sj, meaning the number of eigenvalues of C% exceeding 
■y/rj \\Ci\\. Recall that Qi denotes the orthogonal projection onto the top di eigenvectors of Cj. The 
arguments hinge on showing that, under 0, di = d for all i G J* and that > d for i such that 
dist(sj, Si n S2) < r/C, for some constant C > 0. 

Take i G I*. Under $7, (43) holds, and applying Weyl's inequality (Stewart and Sun, 1990, 
Cor. IV. 4. 9), we have 

\Pm{Ci)-MSi)\ <(r 2 , Vm = l,...,£>. 

By Lemma 11, Sj = cr 2 P^, so that /3 m (Slj) = cr 2 when m < d and /3 m (Ej) = when m > d. 
Hence, 

/8i(C<) < (c + C)r 2 , /3 d (C<) > (c - C)r 2 , /WC*) < (r 2 . 
This implies that 
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when C < r//2 as in (50) and r/ is sufficiently small. When this is so, di = d by definition of d{. 

Note that the top d eigenvectors of generate T{. Hence, applying the Davis-Kahan theorem, 
stated in Lemma 16, and (43) again, we get that 

||Q i - j PT l ||<^^ = C':=v / 2(d + 2)C, Vie I*. 
This is the equivalent of (43), which leads to the equivalent of (45): 

using the fact that = cr 2 P^. When i,j G are such that iQ = Kj, based on (46), we have 

\\Qi ~ Qj\\ < 2ke + 2('. 

Hence, when r] > 2ke + 2£', two nodes i,j G /* such that Ki = Kj and ||sj — s 3 -[| < e are neighbors 
in the graph. The arguments provided in Section 6.2.2 now apply in exactly the same way to show 
that nodes i G 1* such that K^ = 1 belong to a single connected component in the graph, except 
for possible nodes near the intersection. The same is true of nodes i G I* such that Ki = 2. 

Therefore, it remains to show that these two sets of nodes are not connected. When we take 
i,j G K such that Ki ^ Kj, we have the equivalent of (49), meaning, 

WQi - Qj\\ > sm6 s - 2k(C 18 + l)e - 2(' . 

We choose rj smaller than the RHS, so that these nodes are not neighbors in the graph. 

We next prove that a node i G /* is not neighbor to a node near the intersection because of 
different estimates for the local dimension. Take s G Si such that 5(s) := dist(s, 52) < r. We apply 
Lemma 22 and use the notation there until (53) below, with the exception that we use s 2 = Ps 2 (s) 
and T 2 (i) to denote Ts 2 {s 2 ). Together with Weyl's inequality, we have 

P d+l (C(s)) > /WS(s)) " C 22 r 3 , Pi(C(s)) < + C 22 r 3 , 

which together with Lemma 21 (and proper scaling), implies that 

/WC(fl)) > I(l-cosg max m^ 2(1) )) 2 (l-(^)/r) 2 )f +1 -C 22 r 3 
Pi(C(s)) ~ c+ (5(s)/r)(l - (5(s)/r) 2 )+ /2 + C 22 r 3 

Define s°, T° and T® as in Section 6.2.1. Then, by the triangle inequality, 

8max(Tl, T 2 (ij) > ^ max (T{ ) ,T 2 ) — 9 m ax(Tl, if) — max (T 2 (i), T 2 ). 

By definition, max (T{ ) ,T 2 ) > 9 S , and by Lemma 2, 

flmaxCTi,!?) < 2asin (l A ^\\s - «°||) < Cr, 

and similarly, 

^max(T 2(1) ,T 2 ) < 2asin (l A ^||s 2 - < Cr, 

because ||s — s°\\ < Cr and ||s 2 — < r + \\s — s°\\ < Cr. Hence, for r small enough, 
^max(7i,T 2 ( 1 )) > 6 s /2, and furthermore, 



Pi(C(s)) — ~ r 2 



when i_^!>^/(d+2) ; (53) 
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where 

9(l + l/cyr? 
(l-cos(# s /2)) 2 ' 

by the fact that rj > r in (50). The same is true for points on s £ S 2 if we redefine S(s) = dist(s, Si). 
Hence, for close enough to the intersection that 5(si) satisfies (53), d{ > d. Then, by Lemma 15, 
WQi ~ = 1 f° r an y j ^ !*• By our choice of 77 < 1, this means that i and j are not neighbors. 

So the only way {i £ I* : Ki = 1} and {i £ 1+ : Ki = 2} are connected in the graph is if there 
are Sj 6 Si and Sj E S2 such that ||sj — Sj\\ < e and both 5(si) and fail to satisfy (53). We 

now show this is not possible. By Lemma 22, we have 

HCi-SiH <C 22 r 3 . 

By (39) (and using the corresponding notation) and the triangle inequality 

-a iC r 2 P Tx \\ < c{l-a i )t 2 i r 2 + a i {l-a i )5 2 {s i )<2{l-a i )r 2 
< 2(1 - (S( Sl )/r) 2 ) d J 2 r 2 < 2^ d /^r 2 , 

where the very last inequality comes from <5(sj) not satisfying (53). Hence, 

\\C l -a l cr 2 P Ti \\<2^ d+2 \ 2 + C 22 r\ 

and since the (3d+i{PTi) = 0, by the Davis-Kahan theorem, we have 

\\Qi-P Ti \\ < -^k d/(d+2) r 2 + C 22 r 3 ] <C{e l{d+2) +r), 
a>icr z 1 J 

and similarly, 

WQj-PtjW <C(^ d+2 ^ + r). 

By Lemma 15, ||P^ — Py. || = sin # max (?i, Tj). Let s° = -P5 1 nS 2 ( s i)> an d define T® and as before. 
We have 

dmsx(Ti, Tj) > 9 max (T^ , T 2 ) — mS/X (Ti, Ti) — max {Tj, T 2 ) > 0$ — Ce, 

calling in Lemma 2 as before, coupled with the fact that ||sj — s°|| < Ce and [|s» — s°|| < Ce, since 
dist(sj,5 < 2) < \\si — Sj\\ < e and Lemma 18 applies, and then \\sj — s°\\ < \\si — s°\\ + \\sj — Si\\. 
Hence, assuming e is small enough, 

WQi-QjW > WPri-PrjW-WQi-PriW-WQj-PrjW 
> sm(9 s /2)-C(C d/(d+2) + r) > rj, 

when r and ij (and therefore £) are small enough. Therefore i and j are not neighbors, as we needed 
to show. 

We conclude by remarking that, by choosing C large enough in (13), the resulting choice of 
parameters fits all our (often implicit) requirements. 

6.4 Noisy case 

So far we only dealt with the case where r = in (11). When r > 0, a sample point Xi is in general 
different than its corresponding point Si sampled from one of the surfaces. However, when r/r is 
small, this does not change things much. For one thing, the points are close to each other, since we 
have \\xi — Sj|| < r by assumption, and r is small compared to r. And the corresponding covariance 
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matrices are also close to each other. To see this, redefine Sj = {j 7^ i : Xj G A^(a;j)} and Cj as 
the sample covariance of {ajj : j G 3j}. Let -Dj denote the sample covariance of {sj : j £ Hj}. Let 
X be uniform over {ajj : j G HJ and define V = ^ • Sjll{x=:c,}- As in (24), we have 

\\Di-Ci\\ = ||Cov(X)-Cov(F)|| 

< E [\\X — Yf} 1/2 ■ (E [\\X - Xi f} 1/2 + E [\\Y — Xi f} 1/2 ) 

< t ■ (r + r + t) = r 2 (2r/r + (r/r) 2 ), (54) 

which is small compared to r 2 , which is the operating scale for covariance matrices in our setting. 

Using these facts, the arguments are virtually the same, except for some additional terms due 
to triangle inequalities, for example, ||sj — 8j\\ — 2r < ||a3j — Xj\\ < \\si — Sj\\ +2r. In particular, this 
results in £ in (44) being now redefined as ( = ^ + t + C\^kt. We omit further technical details. 
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